1\input texinfo @c -*-texinfo-*- 2@c %**start of header (This is for running Texinfo on a region.) 3@setfilename gawk.info 4@settitle The GNU Awk User's Guide 5@c %**end of header (This is for running Texinfo on a region.) 6 7@c inside ifinfo for older versions of texinfo.tex 8@ifinfo 9@c I hope this is the right category 10@dircategory Programming Languages 11@direntry 12* Gawk: (gawk). A Text Scanning and Processing Language. 13@end direntry 14@end ifinfo 15 16@c @set xref-automatic-section-title 17@c @set DRAFT 18 19@c The following information should be updated here only! 20@c This sets the edition of the document, the version of gawk it 21@c applies to, and when the document was updated. 22@set TITLE Effective AWK Programming 23@set SUBTITLE A User's Guide for GNU Awk 24@set PATCHLEVEL 6 25@set EDITION 1.0.@value{PATCHLEVEL} 26@set VERSION 3.0 27@set UPDATE-MONTH July, 2000 28@iftex 29@set DOCUMENT book 30@end iftex 31@ifinfo 32@set DOCUMENT Info file 33@end ifinfo 34 35@ignore 36Some comments on the layout for TeX. 371. Use at least texinfo.tex 2.159. It contains fixes that 38 are needed to get the footings for draft mode to not appear. 392. I have done A LOT of work to make this look good. There are `@page' commands 40 and use of `@group ... @end group' in a number of places. If you muck 41 with anything, it's your responsibility not to break the layout. 42@end ignore 43 44@c merge the function and variable indexes into the concept index 45@ifinfo 46@synindex fn cp 47@synindex vr cp 48@end ifinfo 49@iftex 50@syncodeindex fn cp 51@syncodeindex vr cp 52@end iftex 53 54@c If "finalout" is commented out, the printed output will show 55@c black boxes that mark lines that are too long. Thus, it is 56@c unwise to comment it out when running a master in case there are 57@c overfulls which are deemed okay. 58 59@ifclear DRAFT 60@iftex 61@finalout 62@end iftex 63@end ifclear 64 65@smallbook 66@iftex 67@c @cropmarks 68@end iftex 69 70@ifinfo 71This file documents @code{awk}, a program that you can use to select 72particular records in a file and perform operations upon them. 73 74This is Edition @value{EDITION} of @cite{@value{TITLE}}, 75for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation of AWK. 76 77Copyright (C) 1989, 1991, 1992, 1993, 1996-2000 Free Software Foundation, Inc. 78 79Permission is granted to make and distribute verbatim copies of 80this manual provided the copyright notice and this permission notice 81are preserved on all copies. 82 83@ignore 84Permission is granted to process this file through TeX and print the 85results, provided the printed document carries copying permission 86notice identical to this one except for the removal of this paragraph 87(this paragraph not being relevant to the printed manual). 88 89@end ignore 90Permission is granted to copy and distribute modified versions of this 91manual under the conditions for verbatim copying, provided that the entire 92resulting derived work is distributed under the terms of a permission 93notice identical to this one. 94 95Permission is granted to copy and distribute translations of this manual 96into another language, under the above conditions for modified versions, 97except that this permission notice may be stated in a translation approved 98by the Foundation. 99@end ifinfo 100 101@setchapternewpage odd 102 103@titlepage 104@title @value{TITLE} 105@subtitle @value{SUBTITLE} 106@subtitle Edition @value{EDITION} 107@subtitle @value{UPDATE-MONTH} 108@author Arnold D. Robbins 109@ignore 110@sp 1 111@author Based on @cite{The GAWK Manual}, 112@author by Robbins, Close, Rubin, and Stallman 113@end ignore 114 115@c Include the Distribution inside the titlepage environment so 116@c that headings are turned off. Headings on and off do not work. 117 118@page 119@vskip 0pt plus 1filll 120@ifset LEGALJUNK 121The programs and applications presented in this book have been 122included for their instructional value. They have been tested with care, 123but are not guaranteed for any particular purpose. The publisher does not 124offer any warranties or representations, nor does it accept any 125liabilities with respect to the programs or applications. 126So there. 127@sp 2 128UNIX is a registered trademark of X/Open, Ltd. @* 129Microsoft, MS, and MS-DOS are registered trademarks, and Windows is a 130trademark of Microsoft Corporation in the United States and other 131countries. @* 132Atari, 520ST, 1040ST, TT, STE, Mega, and Falcon are registered trademarks 133or trademarks of Atari Corporation. @* 134DEC, Digital, OpenVMS, ULTRIX, and VMS, are trademarks of Digital Equipment 135Corporation. @* 136@end ifset 137``To boldly go where no man has gone before'' is a 138Registered Trademark of Paramount Pictures Corporation. @* 139@c sorry, i couldn't resist 140@sp 3 141Copyright @copyright{} 1989, 1991, 1992, 1993, 1996-2000 Free Software Foundation, Inc. 142@sp 2 143 144This is Edition @value{EDITION} of @cite{@value{TITLE}}, @* 145for the @value{VERSION}.@value{PATCHLEVEL} (or later) version of the GNU implementation of AWK. 146 147@sp 2 148Published by: 149 150Free Software Foundation @* 15159 Temple Place --- Suite 330 @* 152Boston, MA 02111-1307 USA @* 153Phone: +1-617-542-5942 @* 154Fax: +1-617-542-2652 @* 155Email: @code{gnu@@gnu.org} @* 156URL: @code{http://www.gnu.org/} @* 157 158@sp 1 159@c this ISBN can change! 160@c This one is correct for gawk 3.0 and edition 1.0 from the FSF 161ISBN 1-882114-26-4 @* 162 163Permission is granted to make and distribute verbatim copies of 164this manual provided the copyright notice and this permission notice 165are preserved on all copies. 166 167Permission is granted to copy and distribute modified versions of this 168manual under the conditions for verbatim copying, provided that the entire 169resulting derived work is distributed under the terms of a permission 170notice identical to this one. 171 172Permission is granted to copy and distribute translations of this manual 173into another language, under the above conditions for modified versions, 174except that this permission notice may be stated in a translation approved 175by the Foundation. 176@sp 2 177Cover art by Etienne Suvasa. 178@end titlepage 179 180@c Thanks to Bob Chassell for directions on doing dedications. 181@iftex 182@headings off 183@page 184@w{ } 185@sp 9 186@center @i{To Miriam, for making me complete.} 187@sp 1 188@center @i{To Chana, for the joy you bring us.} 189@sp 1 190@center @i{To Rivka, for the exponential increase.} 191@sp 1 192@center @i{To Nachum, for the added dimension.} 193@sp 1 194@center @i{To Malka, for the new beginning.} 195@page 196@w{ } 197@page 198@headings on 199@end iftex 200 201@iftex 202@headings off 203@evenheading @thispage@ @ @ @strong{@value{TITLE}} @| @| 204@oddheading @| @| @strong{@thischapter}@ @ @ @thispage 205@ifset DRAFT 206@evenfooting @today{} @| @emph{DRAFT!} @| Please Do Not Redistribute 207@oddfooting Please Do Not Redistribute @| @emph{DRAFT!} @| @today{} 208@end ifset 209@end iftex 210 211@ifinfo 212@node Top, Preface, (dir), (dir) 213@top General Introduction 214@c Preface or Licensing nodes should come right after the Top 215@c node, in `unnumbered' sections, then the chapter, `What is gawk'. 216 217This file documents @code{awk}, a program that you can use to select 218particular records in a file and perform operations upon them. 219 220This is Edition @value{EDITION} of @cite{@value{TITLE}}, @* 221for the @value{VERSION}.@value{PATCHLEVEL} version of the GNU implementation @* 222of AWK. 223 224@end ifinfo 225 226@menu 227* Preface:: What this @value{DOCUMENT} is about; brief 228 history and acknowledgements. 229* What Is Awk:: What is the @code{awk} language; using this 230 @value{DOCUMENT}. 231* Getting Started:: A basic introduction to using @code{awk}. How 232 to run an @code{awk} program. Command line 233 syntax. 234* One-liners:: Short, sample @code{awk} programs. 235* Regexp:: All about matching things using regular 236 expressions. 237* Reading Files:: How to read files and manipulate fields. 238* Printing:: How to print using @code{awk}. Describes the 239 @code{print} and @code{printf} statements. 240 Also describes redirection of output. 241* Expressions:: Expressions are the basic building blocks of 242 statements. 243* Patterns and Actions:: Overviews of patterns and actions. 244* Statements:: The various control statements are described 245 in detail. 246* Built-in Variables:: Built-in Variables 247* Arrays:: The description and use of arrays. Also 248 includes array-oriented control statements. 249* Built-in:: The built-in functions are summarized here. 250* User-defined:: User-defined functions are described in 251 detail. 252* Invoking Gawk:: How to run @code{gawk}. 253* Library Functions:: A Library of @code{awk} Functions. 254* Sample Programs:: Many @code{awk} programs with complete 255 explanations. 256* Language History:: The evolution of the @code{awk} language. 257* Gawk Summary:: @code{gawk} Options and Language Summary. 258* Installation:: Installing @code{gawk} under various operating 259 systems. 260* Notes:: Something about the implementation of 261 @code{gawk}. 262* Glossary:: An explanation of some unfamiliar terms. 263* Copying:: Your right to copy and distribute @code{gawk}. 264* Index:: Concept and Variable Index. 265 266* History:: The history of @code{gawk} and @code{awk}. 267* Manual History:: Brief history of the GNU project and this 268 @value{DOCUMENT}. 269* Acknowledgements:: Acknowledgements. 270* This Manual:: Using this @value{DOCUMENT}. Includes sample 271 input files that you can use. 272* Conventions:: Typographical Conventions. 273* Sample Data Files:: Sample data files for use in the @code{awk} 274 programs illustrated in this @value{DOCUMENT}. 275* Names:: What name to use to find @code{awk}. 276* Running gawk:: How to run @code{gawk} programs; includes 277 command line syntax. 278* One-shot:: Running a short throw-away @code{awk} program. 279* Read Terminal:: Using no input files (input from terminal 280 instead). 281* Long:: Putting permanent @code{awk} programs in 282 files. 283* Executable Scripts:: Making self-contained @code{awk} programs. 284* Comments:: Adding documentation to @code{gawk} programs. 285* Very Simple:: A very simple example. 286* Two Rules:: A less simple one-line example with two rules. 287* More Complex:: A more complex example. 288* Statements/Lines:: Subdividing or combining statements into 289 lines. 290* Other Features:: Other Features of @code{awk}. 291* When:: When to use @code{gawk} and when to use other 292 things. 293* Regexp Usage:: How to Use Regular Expressions. 294* Escape Sequences:: How to write non-printing characters. 295* Regexp Operators:: Regular Expression Operators. 296* GNU Regexp Operators:: Operators specific to GNU software. 297* Case-sensitivity:: How to do case-insensitive matching. 298* Leftmost Longest:: How much text matches. 299* Computed Regexps:: Using Dynamic Regexps. 300* Records:: Controlling how data is split into records. 301* Fields:: An introduction to fields. 302* Non-Constant Fields:: Non-constant Field Numbers. 303* Changing Fields:: Changing the Contents of a Field. 304* Field Separators:: The field separator and how to change it. 305* Basic Field Splitting:: How fields are split with single characters or 306 simple strings. 307* Regexp Field Splitting:: Using regexps as the field separator. 308* Single Character Fields:: Making each character a separate field. 309* Command Line Field Separator:: Setting @code{FS} from the command line. 310* Field Splitting Summary:: Some final points and a summary table. 311* Constant Size:: Reading constant width data. 312* Multiple Line:: Reading multi-line records. 313* Getline:: Reading files under explicit program control 314 using the @code{getline} function. 315* Getline Intro:: Introduction to the @code{getline} function. 316* Plain Getline:: Using @code{getline} with no arguments. 317* Getline/Variable:: Using @code{getline} into a variable. 318* Getline/File:: Using @code{getline} from a file. 319* Getline/Variable/File:: Using @code{getline} into a variable from a 320 file. 321* Getline/Pipe:: Using @code{getline} from a pipe. 322* Getline/Variable/Pipe:: Using @code{getline} into a variable from a 323 pipe. 324* Getline Summary:: Summary Of @code{getline} Variants. 325* Print:: The @code{print} statement. 326* Print Examples:: Simple examples of @code{print} statements. 327* Output Separators:: The output separators and how to change them. 328* OFMT:: Controlling Numeric Output With @code{print}. 329* Printf:: The @code{printf} statement. 330* Basic Printf:: Syntax of the @code{printf} statement. 331* Control Letters:: Format-control letters. 332* Format Modifiers:: Format-specification modifiers. 333* Printf Examples:: Several examples. 334* Redirection:: How to redirect output to multiple files and 335 pipes. 336* Special Files:: File name interpretation in @code{gawk}. 337 @code{gawk} allows access to inherited file 338 descriptors. 339* Close Files And Pipes:: Closing Input and Output Files and Pipes. 340* Constants:: String, numeric, and regexp constants. 341* Scalar Constants:: Numeric and string constants. 342* Regexp Constants:: Regular Expression constants. 343* Using Constant Regexps:: When and how to use a regexp constant. 344* Variables:: Variables give names to values for later use. 345* Using Variables:: Using variables in your programs. 346* Assignment Options:: Setting variables on the command line and a 347 summary of command line syntax. This is an 348 advanced method of input. 349* Conversion:: The conversion of strings to numbers and vice 350 versa. 351* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, 352 etc.) 353* Concatenation:: Concatenating strings. 354* Assignment Ops:: Changing the value of a variable or a field. 355* Increment Ops:: Incrementing the numeric value of a variable. 356* Truth Values:: What is ``true'' and what is ``false''. 357* Typing and Comparison:: How variables acquire types, and how this 358 affects comparison of numbers and strings with 359 @samp{<}, etc. 360* Boolean Ops:: Combining comparison expressions using boolean 361 operators @samp{||} (``or''), @samp{&&} 362 (``and'') and @samp{!} (``not''). 363* Conditional Exp:: Conditional expressions select between two 364 subexpressions under control of a third 365 subexpression. 366* Function Calls:: A function call is an expression. 367* Precedence:: How various operators nest. 368* Pattern Overview:: What goes into a pattern. 369* Kinds of Patterns:: A list of all kinds of patterns. 370* Regexp Patterns:: Using regexps as patterns. 371* Expression Patterns:: Any expression can be used as a pattern. 372* Ranges:: Pairs of patterns specify record ranges. 373* BEGIN/END:: Specifying initialization and cleanup rules. 374* Using BEGIN/END:: How and why to use BEGIN/END rules. 375* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. 376* Empty:: The empty pattern, which matches every record. 377* Action Overview:: What goes into an action. 378* If Statement:: Conditionally execute some @code{awk} 379 statements. 380* While Statement:: Loop until some condition is satisfied. 381* Do Statement:: Do specified action while looping until some 382 condition is satisfied. 383* For Statement:: Another looping statement, that provides 384 initialization and increment clauses. 385* Break Statement:: Immediately exit the innermost enclosing loop. 386* Continue Statement:: Skip to the end of the innermost enclosing 387 loop. 388* Next Statement:: Stop processing the current input record. 389* Nextfile Statement:: Stop processing the current file. 390* Exit Statement:: Stop execution of @code{awk}. 391* User-modified:: Built-in variables that you change to control 392 @code{awk}. 393* Auto-set:: Built-in variables where @code{awk} gives you 394 information. 395* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. 396* Array Intro:: Introduction to Arrays 397* Reference to Elements:: How to examine one element of an array. 398* Assigning Elements:: How to change an element of an array. 399* Array Example:: Basic Example of an Array 400* Scanning an Array:: A variation of the @code{for} statement. It 401 loops through the indices of an array's 402 existing elements. 403* Delete:: The @code{delete} statement removes an element 404 from an array. 405* Numeric Array Subscripts:: How to use numbers as subscripts in 406 @code{awk}. 407* Uninitialized Subscripts:: Using Uninitialized variables as subscripts. 408* Multi-dimensional:: Emulating multi-dimensional arrays in 409 @code{awk}. 410* Multi-scanning:: Scanning multi-dimensional arrays. 411* Calling Built-in:: How to call built-in functions. 412* Numeric Functions:: Functions that work with numbers, including 413 @code{int}, @code{sin} and @code{rand}. 414* String Functions:: Functions for string manipulation, such as 415 @code{split}, @code{match}, and 416 @code{sprintf}. 417* I/O Functions:: Functions for files and shell commands. 418* Time Functions:: Functions for dealing with time stamps. 419* Definition Syntax:: How to write definitions and what they mean. 420* Function Example:: An example function definition and what it 421 does. 422* Function Caveats:: Things to watch out for. 423* Return Statement:: Specifying the value a function returns. 424* Options:: Command line options and their meanings. 425* Other Arguments:: Input file names and variable assignments. 426* AWKPATH Variable:: Searching directories for @code{awk} programs. 427* Obsolete:: Obsolete Options and/or features. 428* Undocumented:: Undocumented Options and Features. 429* Known Bugs:: Known Bugs in @code{gawk}. 430* Portability Notes:: What to do if you don't have @code{gawk}. 431* Nextfile Function:: Two implementations of a @code{nextfile} 432 function. 433* Assert Function:: A function for assertions in @code{awk} 434 programs. 435* Round Function:: A function for rounding if @code{sprintf} does 436 not do it correctly. 437* Ordinal Functions:: Functions for using characters as numbers and 438 vice versa. 439* Join Function:: A function to join an array into a string. 440* Mktime Function:: A function to turn a date into a timestamp. 441* Gettimeofday Function:: A function to get formatted times. 442* Filetrans Function:: A function for handling data file transitions. 443* Getopt Function:: A function for processing command line 444 arguments. 445* Passwd Functions:: Functions for getting user information. 446* Group Functions:: Functions for getting group information. 447* Library Names:: How to best name private global variables in 448 library functions. 449* Clones:: Clones of common utilities. 450* Cut Program:: The @code{cut} utility. 451* Egrep Program:: The @code{egrep} utility. 452* Id Program:: The @code{id} utility. 453* Split Program:: The @code{split} utility. 454* Tee Program:: The @code{tee} utility. 455* Uniq Program:: The @code{uniq} utility. 456* Wc Program:: The @code{wc} utility. 457* Miscellaneous Programs:: Some interesting @code{awk} programs. 458* Dupword Program:: Finding duplicated words in a document. 459* Alarm Program:: An alarm clock. 460* Translate Program:: A program similar to the @code{tr} utility. 461* Labels Program:: Printing mailing labels. 462* Word Sorting:: A program to produce a word usage count. 463* History Sorting:: Eliminating duplicate entries from a history 464 file. 465* Extract Program:: Pulling out programs from Texinfo source 466 files. 467* Simple Sed:: A Simple Stream Editor. 468* Igawk Program:: A wrapper for @code{awk} that includes files. 469* V7/SVR3.1:: The major changes between V7 and System V 470 Release 3.1. 471* SVR4:: Minor changes between System V Releases 3.1 472 and 4. 473* POSIX:: New features from the POSIX standard. 474* BTL:: New features from the Bell Laboratories 475 version of @code{awk}. 476* POSIX/GNU:: The extensions in @code{gawk} not in POSIX 477 @code{awk}. 478* Command Line Summary:: Recapitulation of the command line. 479* Language Summary:: A terse review of the language. 480* Variables/Fields:: Variables, fields, and arrays. 481* Fields Summary:: Input field splitting. 482* Built-in Summary:: @code{awk}'s built-in variables. 483* Arrays Summary:: Using arrays. 484* Data Type Summary:: Values in @code{awk} are numbers or strings. 485* Rules Summary:: Patterns and Actions, and their component 486 parts. 487* Pattern Summary:: Quick overview of patterns. 488* Regexp Summary:: Quick overview of regular expressions. 489* Actions Summary:: Quick overview of actions. 490* Operator Summary:: @code{awk} operators. 491* Control Flow Summary:: The control statements. 492* I/O Summary:: The I/O statements. 493* Printf Summary:: A summary of @code{printf}. 494* Special File Summary:: Special file names interpreted internally. 495* Built-in Functions Summary:: Built-in numeric and string functions. 496* Time Functions Summary:: Built-in time functions. 497* String Constants Summary:: Escape sequences in strings. 498* Functions Summary:: Defining and calling functions. 499* Historical Features:: Some undocumented but supported ``features''. 500* Gawk Distribution:: What is in the @code{gawk} distribution. 501* Getting:: How to get the distribution. 502* Extracting:: How to extract the distribution. 503* Distribution contents:: What is in the distribution. 504* Unix Installation:: Installing @code{gawk} under various versions 505 of Unix. 506* Quick Installation:: Compiling @code{gawk} under Unix. 507* Configuration Philosophy:: How it's all supposed to work. 508* VMS Installation:: Installing @code{gawk} on VMS. 509* VMS Compilation:: How to compile @code{gawk} under VMS. 510* VMS Installation Details:: How to install @code{gawk} under VMS. 511* VMS Running:: How to run @code{gawk} under VMS. 512* VMS POSIX:: Alternate instructions for VMS POSIX. 513* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS 514 and OS/2 515* Atari Installation:: Installing @code{gawk} on the Atari ST. 516* Atari Compiling:: Compiling @code{gawk} on Atari 517* Atari Using:: Running @code{gawk} on Atari 518* Amiga Installation:: Installing @code{gawk} on an Amiga. 519* Bugs:: Reporting Problems and Bugs. 520* Other Versions:: Other freely available @code{awk} 521 implementations. 522* Compatibility Mode:: How to disable certain @code{gawk} extensions. 523* Additions:: Making Additions To @code{gawk}. 524* Adding Code:: Adding code to the main body of @code{gawk}. 525* New Ports:: Porting @code{gawk} to a new operating system. 526* Future Extensions:: New features that may be implemented one day. 527* Improvements:: Suggestions for improvements by volunteers. 528 529@end menu 530 531@c dedication for Info file 532@ifinfo 533@center To Miriam, for making me complete. 534@sp 1 535@center To Chana, for the joy you bring us. 536@sp 1 537@center To Rivka, for the exponential increase. 538@sp 1 539@center To Nachum, for the added dimension. 540@sp 1 541@center To Malka, for the new beginning. 542@end ifinfo 543 544@node Preface, What Is Awk, Top, Top 545@unnumbered Preface 546 547@c I saw a comment somewhere that the preface should describe the book itself, 548@c and the introduction should describe what the book covers. 549 550This @value{DOCUMENT} teaches you about the @code{awk} language and 551how you can use it effectively. You should already be familiar with basic 552system commands, such as @code{cat} and @code{ls},@footnote{These commands 553are available on POSIX compliant systems, as well as on traditional Unix 554based systems. If you are using some other operating system, you still need to 555be familiar with the ideas of I/O redirection and pipes.} and basic shell 556facilities, such as Input/Output (I/O) redirection and pipes. 557 558Implementations of the @code{awk} language are available for many different 559computing environments. This @value{DOCUMENT}, while describing the @code{awk} language 560in general, also describes a particular implementation of @code{awk} called 561@code{gawk} (which stands for ``GNU Awk''). @code{gawk} runs on a broad range 562of Unix systems, ranging from 80386 PC-based computers, up through large scale 563systems, such as Crays. @code{gawk} has also been ported to MS-DOS and 564OS/2 PC's, Atari and Amiga micro-computers, and VMS. 565 566@menu 567* History:: The history of @code{gawk} and @code{awk}. 568* Manual History:: Brief history of the GNU project and this 569 @value{DOCUMENT}. 570* Acknowledgements:: Acknowledgements. 571@end menu 572 573@node History, Manual History, Preface, Preface 574@unnumberedsec History of @code{awk} and @code{gawk} 575 576@cindex acronym 577@cindex history of @code{awk} 578@cindex Aho, Alfred 579@cindex Weinberger, Peter 580@cindex Kernighan, Brian 581@cindex old @code{awk} 582@cindex new @code{awk} 583The name @code{awk} comes from the initials of its designers: Alfred V.@: 584Aho, Peter J.@: Weinberger, and Brian W.@: Kernighan. The original version of 585@code{awk} was written in 1977 at AT&T Bell Laboratories. 586In 1985 a new version made the programming 587language more powerful, introducing user-defined functions, multiple input 588streams, and computed regular expressions. 589This new version became generally available with Unix System V Release 3.1. 590The version in System V Release 4 added some new features and also cleaned 591up the behavior in some of the ``dark corners'' of the language. 592The specification for @code{awk} in the POSIX Command Language 593and Utilities standard further clarified the language based on feedback 594from both the @code{gawk} designers, and the original Bell Labs @code{awk} 595designers. 596 597The GNU implementation, @code{gawk}, was written in 1986 by Paul Rubin 598and Jay Fenlason, with advice from Richard Stallman. John Woods 599contributed parts of the code as well. In 1988 and 1989, David Trueman, with 600help from Arnold Robbins, thoroughly reworked @code{gawk} for compatibility 601with the newer @code{awk}. Current development focuses on bug fixes, 602performance improvements, standards compliance, and occasionally, new features. 603 604@node Manual History, Acknowledgements, History, Preface 605@unnumberedsec The GNU Project and This Book 606 607@cindex Free Software Foundation 608@cindex Stallman, Richard 609The Free Software Foundation (FSF) is a non-profit organization dedicated 610to the production and distribution of freely distributable software. 611It was founded by Richard M.@: Stallman, the author of the original 612Emacs editor. GNU Emacs is the most widely used version of Emacs today. 613 614@cindex GNU Project 615The GNU project is an on-going effort on the part of the Free Software 616Foundation to create a complete, freely distributable, POSIX compliant 617computing environment. (GNU stands for ``GNU's not Unix''.) 618The FSF uses the ``GNU General Public License'' (or GPL) to ensure that 619source code for their software is always available to the end user. A 620copy of the GPL is included for your reference 621(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}). 622The GPL applies to the C language source code for @code{gawk}. 623 624A shell, an editor (Emacs), highly portable optimizing C, C++, and 625Objective-C compilers, a symbolic debugger, and dozens of large and 626small utilities (such as @code{gawk}), have all been completed and are 627freely available. As of this writing (early 1997), the GNU operating 628system kernel (the HURD), has been released, but is still in an early 629stage of development. 630 631@cindex Linux 632@cindex NetBSD 633@cindex FreeBSD 634Until the GNU operating system is more fully developed, you should 635consider using Linux, a freely distributable, Unix-like operating 636system for 80386, DEC Alpha, Sun SPARC and other systems. There are 637many books on Linux. One freely available one is @cite{Linux 638Installation and Getting Started}, by Matt Welsh. 639Many Linux distributions are available, often in computer stores or 640bundled on CD-ROM with books about Linux. 641(There are three other freely available, Unix-like operating systems for 64280386 and other systems, NetBSD, FreeBSD,and OpenBSD. All are based on the 6434.4-Lite Berkeley Software Distribution, and they use recent versions 644of @code{gawk} for their versions of @code{awk}.) 645 646@iftex 647This @value{DOCUMENT} you are reading now is actually free. The 648information in it is freely available to anyone, the machine readable 649source code for the @value{DOCUMENT} comes with @code{gawk}, and anyone 650may take this @value{DOCUMENT} to a copying machine and make as many 651copies of it as they like. (Take a moment to check the copying 652permissions on the Copyright page.) 653 654If you paid money for this @value{DOCUMENT}, what you actually paid for 655was the @value{DOCUMENT}'s nice printing and binding, and the 656publisher's associated costs to produce it. We have made an effort to 657keep these costs reasonable; most people would prefer a bound book to 658over 330 pages of photo-copied text that would then have to be held in 659a loose-leaf binder (not to mention the time and labor involved in 660doing the copying). The same is true of producing this 661@value{DOCUMENT} from the machine readable source; the retail price is 662only slightly more than the cost per page of printing it 663on a laser printer. 664@end iftex 665 666This @value{DOCUMENT} itself has gone through several previous, 667preliminary editions. I started working on a preliminary draft of 668@cite{The GAWK Manual}, by Diane Close, Paul Rubin, and Richard 669Stallman in the fall of 1988. 670It was around 90 pages long, and barely described the original, ``old'' 671version of @code{awk}. After substantial revision, the first version of 672the @cite{The GAWK Manual} to be released was Edition 0.11 Beta in 673October of 1989. The manual then underwent more substantial revision 674for Edition 0.13 of December 1991. 675David Trueman, Pat Rankin, and Michal Jaegermann contributed sections 676of the manual for Edition 0.13. 677That edition was published by the 678FSF as a bound book early in 1992. Since then there have been several 679minor revisions, notably Edition 0.14 of November 1992 that was published 680by the FSF in January of 1993, and Edition 0.16 of August 1993. 681 682Edition 1.0 of @cite{@value{TITLE}} represents a significant re-working 683of @cite{The GAWK Manual}, with much additional material. 684The FSF and I agree that I am now the primary author. 685I also felt that it needed a more descriptive title. 686 687@cite{@value{TITLE}} will undoubtedly continue to evolve. 688An electronic version 689comes with the @code{gawk} distribution from the FSF. 690If you find an error in this @value{DOCUMENT}, please report it! 691@xref{Bugs, ,Reporting Problems and Bugs}, for information on submitting 692problem reports electronically, or write to me in care of the FSF. 693 694@node Acknowledgements, , Manual History, Preface 695@unnumberedsec Acknowledgements 696 697@cindex Stallman, Richard 698I would like to acknowledge Richard M.@: Stallman, for his vision of a 699better world, and for his courage in founding the FSF and starting the 700GNU project. 701 702The initial draft of @cite{The GAWK Manual} had the following acknowledgements: 703 704@quotation 705Many people need to be thanked for their assistance in producing this 706manual. Jay Fenlason contributed many ideas and sample programs. Richard 707Mlynarik and Robert Chassell gave helpful comments on drafts of this 708manual. The paper @cite{A Supplemental Document for @code{awk}} by John W.@: 709Pierce of the Chemistry Department at UC San Diego, pinpointed several 710issues relevant both to @code{awk} implementation and to this manual, that 711would otherwise have escaped us. 712@end quotation 713 714The following people provided many helpful comments on Edition 0.13 of 715@cite{The GAWK Manual}: Rick Adams, Michael Brennan, Rich Burridge, Diane Close, 716Christopher (``Topher'') Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins, 717and Michal Jaegermann. 718 719The following people provided many helpful comments for Edition 1.0 of 720@cite{@value{TITLE}}: Karl Berry, Michael Brennan, Darrel 721Hankerson, Michal Jaegermann, Michael Lijewski, and Miriam Robbins. 722Pat Rankin, Michal Jaegermann, Darrel Hankerson and Scott Deifik 723updated their respective sections for Edition 1.0. 724 725Robert J.@: Chassell provided much valuable advice on 726the use of Texinfo. He also deserves special thanks for 727convincing me @emph{not} to title this @value{DOCUMENT} 728@cite{How To Gawk Politely}. 729Karl Berry helped significantly with the @TeX{} part of Texinfo. 730 731@cindex Trueman, David 732David Trueman deserves special credit; he has done a yeoman job 733of evolving @code{gawk} so that it performs well, and without bugs. 734Although he is no longer involved with @code{gawk}, 735working with him on this project was a significant pleasure. 736 737@cindex Deifik, Scott 738@cindex Hankerson, Darrel 739@cindex Rommel, Kai Uwe 740@cindex Rankin, Pat 741@cindex Jaegermann, Michal 742Scott Deifik, Darrel Hankerson, Kai Uwe Rommel, Pat Rankin, and Michal 743Jaegermann (in no particular order) are long time members of the 744@code{gawk} ``crack portability team.'' Without their hard work and 745help, @code{gawk} would not be nearly the fine program it is today. It 746has been and continues to be a pleasure working with this team of fine 747people. 748 749@cindex Friedl, Jeffrey 750Jeffrey Friedl provided invaluable help in tracking down a number 751of last minute problems with regular expressions in @code{gawk} 3.0. 752 753@cindex Kernighan, Brian 754David and I would like to thank Brian Kernighan of Bell Labs for 755invaluable assistance during the testing and debugging of @code{gawk}, and for 756help in clarifying numerous points about the language. We could not have 757done nearly as good a job on either @code{gawk} or its documentation without 758his help. 759 760@cindex Hughes, Phil 761I would like to thank Marshall and Elaine Hartholz of Seattle, and Dr.@: 762Bert and Rita Schreiber of Detroit for large amounts of quiet vacation 763time in their homes, which allowed me to make significant progress on 764this @value{DOCUMENT} and on @code{gawk} itself. Phil Hughes of SSC 765contributed in a very important way by loaning me his laptop Linux 766system, not once, but twice, allowing me to do a lot of work while 767away from home. 768 769@cindex Robbins, Miriam 770Finally, I must thank my wonderful wife, Miriam, for her patience through 771the many versions of this project, for her proof-reading, 772and for sharing me with the computer. 773I would like to thank my parents for their love, and for the grace with 774which they raised and educated me. 775I also must acknowledge my gratitude to G-d, for the many opportunities 776He has sent my way, as well as for the gifts He has given me with which to 777take advantage of those opportunities. 778@sp 2 779@noindent 780Arnold Robbins @* 781Atlanta, Georgia @* 782February, 1997 783 784@ignore 785Stuff still not covered anywhere: 786BASICS: 787 Integer vs. floating point 788 Hex vs. octal vs. decimal 789 Interpreter vs compiler 790 input/output 791@end ignore 792 793@node What Is Awk, Getting Started, Preface, Top 794@chapter Introduction 795 796If you are like many computer users, you would frequently like to make 797changes in various text files wherever certain patterns appear, or 798extract data from parts of certain lines while discarding the rest. To 799write a program to do this in a language such as C or Pascal is a 800time-consuming inconvenience that may take many lines of code. The job 801may be easier with @code{awk}. 802 803The @code{awk} utility interprets a special-purpose programming language 804that makes it possible to handle simple data-reformatting jobs 805with just a few lines of code. 806 807The GNU implementation of @code{awk} is called @code{gawk}; it is fully 808upward compatible with the System V Release 4 version of 809@code{awk}. @code{gawk} is also upward compatible with the POSIX 810specification of the @code{awk} language. This means that all 811properly written @code{awk} programs should work with @code{gawk}. 812Thus, we usually don't distinguish between @code{gawk} and other @code{awk} 813implementations. 814 815@cindex uses of @code{awk} 816Using @code{awk} you can: 817 818@itemize @bullet 819@item 820manage small, personal databases 821 822@item 823generate reports 824 825@item 826validate data 827 828@item 829produce indexes, and perform other document preparation tasks 830 831@item 832even experiment with algorithms that can be adapted later to other computer 833languages 834@end itemize 835 836@menu 837* This Manual:: Using this @value{DOCUMENT}. Includes sample 838 input files that you can use. 839* Conventions:: Typographical Conventions. 840* Sample Data Files:: Sample data files for use in the @code{awk} 841 programs illustrated in this @value{DOCUMENT}. 842@end menu 843 844@node This Manual, Conventions, What Is Awk, What Is Awk 845@section Using This Book 846@cindex book, using this 847@cindex using this book 848@cindex language, @code{awk} 849@cindex program, @code{awk} 850@ignore 851@cindex @code{awk} language 852@cindex @code{awk} program 853@end ignore 854 855The term @code{awk} refers to a particular program, and to the language you 856use to tell this program what to do. When we need to be careful, we call 857the program ``the @code{awk} utility'' and the language ``the @code{awk} 858language.'' The term @code{gawk} refers to a version of @code{awk} developed 859as part the GNU project. The purpose of this @value{DOCUMENT} is to explain 860both the @code{awk} language and how to run the @code{awk} utility. 861 862The main purpose of the @value{DOCUMENT} is to explain the features 863of @code{awk}, as defined in the POSIX standard. It does so in the context 864of one particular implementation, @code{gawk}. While doing so, it will also 865attempt to describe important differences between @code{gawk} and other 866@code{awk} implementations. Finally, any @code{gawk} features that 867are not in the POSIX standard for @code{awk} will be noted. 868 869@iftex 870This @value{DOCUMENT} has the difficult task of being both tutorial and reference. 871If you are a novice, feel free to skip over details that seem too complex. 872You should also ignore the many cross references; they are for the 873expert user, and for the on-line Info version of the document. 874@end iftex 875 876The term @dfn{@code{awk} program} refers to a program written by you in 877the @code{awk} programming language. 878 879@xref{Getting Started, ,Getting Started with @code{awk}}, for the bare 880essentials you need to know to start using @code{awk}. 881 882Some useful ``one-liners'' are included to give you a feel for the 883@code{awk} language (@pxref{One-liners, ,Useful One Line Programs}). 884 885Many sample @code{awk} programs have been provided for you 886(@pxref{Library Functions, ,A Library of @code{awk} Functions}; also 887@pxref{Sample Programs, ,Practical @code{awk} Programs}). 888 889The entire @code{awk} language is summarized for quick reference in 890@ref{Gawk Summary, ,@code{gawk} Summary}. Look there if you just need 891to refresh your memory about a particular feature. 892 893If you find terms that you aren't familiar with, try looking them 894up in the glossary (@pxref{Glossary}). 895 896Most of the time complete @code{awk} programs are used as examples, but in 897some of the more advanced sections, only the part of the @code{awk} program 898that illustrates the concept being described is shown. 899 900While this @value{DOCUMENT} is aimed principally at people who have not been 901exposed 902to @code{awk}, there is a lot of information here that even the @code{awk} 903expert should find useful. In particular, the description of POSIX 904@code{awk}, and the example programs in 905@ref{Library Functions, ,A Library of @code{awk} Functions}, and 906@ref{Sample Programs, ,Practical @code{awk} Programs}, 907should be of interest. 908 909@c fakenode --- for prepinfo 910@unnumberedsubsec Dark Corners 911@display 912@i{Who opened that window shade?!?} 913Count Dracula 914@end display 915@sp 1 916 917@cindex d.c., see ``dark corner'' 918@cindex dark corner 919Until the POSIX standard (and @cite{The Gawk Manual}), 920many features of @code{awk} were either poorly documented, or not 921documented at all. Descriptions of such features 922(often called ``dark corners'') are noted in this @value{DOCUMENT} with 923``(d.c.)''. 924They also appear in the index under the heading ``dark corner.'' 925 926@node Conventions, Sample Data Files, This Manual, What Is Awk 927@section Typographical Conventions 928 929This @value{DOCUMENT} is written using Texinfo, the GNU documentation formatting language. 930A single Texinfo source file is used to produce both the printed and on-line 931versions of the documentation. 932@iftex 933Because of this, the typographical conventions 934are slightly different than in other books you may have read. 935@end iftex 936@ifinfo 937This section briefly documents the typographical conventions used in Texinfo. 938@end ifinfo 939 940Examples you would type at the command line are preceded by the common 941shell primary and secondary prompts, @samp{$} and @samp{>}. 942Output from the command is preceded by the glyph ``@print{}''. 943This typically represents the command's standard output. 944Error messages, and other output on the command's standard error, are preceded 945by the glyph ``@error{}''. For example: 946 947@example 948@group 949$ echo hi on stdout 950@print{} hi on stdout 951$ echo hello on stderr 1>&2 952@error{} hello on stderr 953@end group 954@end example 955 956@iftex 957In the text, command names appear in @code{this font}, while code segments 958appear in the same font and quoted, @samp{like this}. Some things will 959be emphasized @emph{like this}, and if a point needs to be made 960strongly, it will be done @strong{like this}. The first occurrence of 961a new term is usually its @dfn{definition}, and appears in the same 962font as the previous occurrence of ``definition'' in this sentence. 963File names are indicated like this: @file{/path/to/ourfile}. 964@end iftex 965 966Characters that you type at the keyboard look @kbd{like this}. In particular, 967there are special characters called ``control characters.'' These are 968characters that you type by holding down both the @kbd{CONTROL} key and 969another key, at the same time. For example, a @kbd{Control-d} is typed 970by first pressing and holding the @kbd{CONTROL} key, next 971pressing the @kbd{d} key, and finally releasing both keys. 972 973@node Sample Data Files, , Conventions, What Is Awk 974@section Data Files for the Examples 975 976@cindex input file, sample 977@cindex sample input file 978@cindex @file{BBS-list} file 979Many of the examples in this @value{DOCUMENT} take their input from two sample 980data files. The first, called @file{BBS-list}, represents a list of 981computer bulletin board systems together with information about those systems. 982The second data file, called @file{inventory-shipped}, contains 983information about shipments on a monthly basis. In both files, 984each line is considered to be one @dfn{record}. 985 986In the file @file{BBS-list}, each record contains the name of a computer 987bulletin board, its phone number, the board's baud rate(s), and a code for 988the number of hours it is operational. An @samp{A} in the last column 989means the board operates 24 hours a day. A @samp{B} in the last 990column means the board operates evening and weekend hours, only. A 991@samp{C} means the board operates only on weekends. 992 993@c 2e: Update the baud rates to reflect today's faster modems 994@example 995@c system mkdir eg 996@c system mkdir eg/lib 997@c system mkdir eg/data 998@c system mkdir eg/prog 999@c system mkdir eg/misc 1000@c file eg/data/BBS-list 1001aardvark 555-5553 1200/300 B 1002alpo-net 555-3412 2400/1200/300 A 1003barfly 555-7685 1200/300 A 1004bites 555-1675 2400/1200/300 A 1005camelot 555-0542 300 C 1006core 555-2912 1200/300 C 1007fooey 555-1234 2400/1200/300 B 1008foot 555-6699 1200/300 B 1009macfoo 555-6480 1200/300 A 1010sdace 555-3430 2400/1200/300 A 1011sabafoo 555-2127 1200/300 C 1012@c endfile 1013@end example 1014 1015@cindex @file{inventory-shipped} file 1016The second data file, called @file{inventory-shipped}, represents 1017information about shipments during the year. 1018Each record contains the month of the year, the number 1019of green crates shipped, the number of red boxes shipped, the number of 1020orange bags shipped, and the number of blue packages shipped, 1021respectively. There are 16 entries, covering the 12 months of one year 1022and four months of the next year. 1023 1024@example 1025@c file eg/data/inventory-shipped 1026Jan 13 25 15 115 1027Feb 15 32 24 226 1028Mar 15 24 34 228 1029Apr 31 52 63 420 1030May 16 34 29 208 1031Jun 31 42 75 492 1032Jul 24 34 67 436 1033Aug 15 34 47 316 1034Sep 13 55 37 277 1035Oct 29 54 68 525 1036Nov 20 87 82 577 1037Dec 17 35 61 401 1038 1039Jan 21 36 64 620 1040Feb 26 58 80 652 1041Mar 24 75 70 495 1042Apr 21 70 74 514 1043@c endfile 1044@end example 1045 1046@ifinfo 1047If you are reading this in GNU Emacs using Info, you can copy the regions 1048of text showing these sample files into your own test files. This way you 1049can try out the examples shown in the remainder of this document. You do 1050this by using the command @kbd{M-x write-region} to copy text from the Info 1051file into a file for use with @code{awk} 1052(@xref{Misc File Ops, , Miscellaneous File Operations, emacs, GNU Emacs Manual}, 1053for more information). Using this information, create your own 1054@file{BBS-list} and @file{inventory-shipped} files, and practice what you 1055learn in this @value{DOCUMENT}. 1056 1057If you are using the stand-alone version of Info, 1058see @ref{Extract Program, ,Extracting Programs from Texinfo Source Files}, 1059for an @code{awk} program that will extract these data files from 1060@file{gawk.texi}, the Texinfo source file for this Info file. 1061@end ifinfo 1062 1063@node Getting Started, One-liners, What Is Awk, Top 1064@chapter Getting Started with @code{awk} 1065@cindex script, definition of 1066@cindex rule, definition of 1067@cindex program, definition of 1068@cindex basic function of @code{awk} 1069 1070The basic function of @code{awk} is to search files for lines (or other 1071units of text) that contain certain patterns. When a line matches one 1072of the patterns, @code{awk} performs specified actions on that line. 1073@code{awk} keeps processing input lines in this way until the end of the 1074input files are reached. 1075 1076@cindex data-driven languages 1077@cindex procedural languages 1078@cindex language, data-driven 1079@cindex language, procedural 1080Programs in @code{awk} are different from programs in most other languages, 1081because @code{awk} programs are @dfn{data-driven}; that is, you describe 1082the data you wish to work with, and then what to do when you find it. 1083Most other languages are @dfn{procedural}; you have to describe, in great 1084detail, every step the program is to take. When working with procedural 1085languages, it is usually much 1086harder to clearly describe the data your program will process. 1087For this reason, @code{awk} programs are often refreshingly easy to both 1088write and read. 1089 1090@cindex program, definition of 1091@cindex rule, definition of 1092When you run @code{awk}, you specify an @code{awk} @dfn{program} that 1093tells @code{awk} what to do. The program consists of a series of 1094@dfn{rules}. (It may also contain @dfn{function definitions}, 1095an advanced feature which we will ignore for now. 1096@xref{User-defined, ,User-defined Functions}.) Each rule specifies one 1097pattern to search for, and one action to perform when that pattern is found. 1098 1099Syntactically, a rule consists of a pattern followed by an action. The 1100action is enclosed in curly braces to separate it from the pattern. 1101Rules are usually separated by newlines. Therefore, an @code{awk} 1102program looks like this: 1103 1104@example 1105@var{pattern} @{ @var{action} @} 1106@var{pattern} @{ @var{action} @} 1107@dots{} 1108@end example 1109 1110@menu 1111* Names:: What name to use to find @code{awk}. 1112* Running gawk:: How to run @code{gawk} programs; includes 1113 command line syntax. 1114* Very Simple:: A very simple example. 1115* Two Rules:: A less simple one-line example with two rules. 1116* More Complex:: A more complex example. 1117* Statements/Lines:: Subdividing or combining statements into 1118 lines. 1119* Other Features:: Other Features of @code{awk}. 1120* When:: When to use @code{gawk} and when to use other 1121 things. 1122@end menu 1123 1124@node Names, Running gawk , Getting Started, Getting Started 1125@section A Rose By Any Other Name 1126 1127@cindex old @code{awk} vs. new @code{awk} 1128@cindex new @code{awk} vs. old @code{awk} 1129The @code{awk} language has evolved over the years. Full details are 1130provided in @ref{Language History, ,The Evolution of the @code{awk} Language}. 1131The language described in this @value{DOCUMENT} 1132is often referred to as ``new @code{awk}.'' 1133 1134Because of this, many systems have multiple 1135versions of @code{awk}. 1136Some systems have an @code{awk} utility that implements the 1137original version of the @code{awk} language, and a @code{nawk} utility 1138for the new version. Others have an @code{oawk} for the ``old @code{awk}'' 1139language, and plain @code{awk} for the new one. Still others only 1140have one version, usually the new one.@footnote{Often, these systems 1141use @code{gawk} for their @code{awk} implementation!} 1142 1143All in all, this makes it difficult for you to know which version of 1144@code{awk} you should run when writing your programs. The best advice 1145we can give here is to check your local documentation. Look for @code{awk}, 1146@code{oawk}, and @code{nawk}, as well as for @code{gawk}. Chances are, you 1147will have some version of new @code{awk} on your system, and that is what 1148you should use when running your programs. (Of course, if you're reading 1149this @value{DOCUMENT}, chances are good that you have @code{gawk}!) 1150 1151Throughout this @value{DOCUMENT}, whenever we refer to a language feature 1152that should be available in any complete implementation of POSIX @code{awk}, 1153we simply use the term @code{awk}. When referring to a feature that is 1154specific to the GNU implementation, we use the term @code{gawk}. 1155 1156@node Running gawk, Very Simple, Names, Getting Started 1157@section How to Run @code{awk} Programs 1158 1159@cindex command line formats 1160@cindex running @code{awk} programs 1161There are several ways to run an @code{awk} program. If the program is 1162short, it is easiest to include it in the command that runs @code{awk}, 1163like this: 1164 1165@example 1166awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} 1167@end example 1168 1169@noindent 1170where @var{program} consists of a series of patterns and actions, as 1171described earlier. 1172(The reason for the single quotes is described below, in 1173@ref{One-shot, ,One-shot Throw-away @code{awk} Programs}.) 1174 1175When the program is long, it is usually more convenient to put it in a file 1176and run it with a command like this: 1177 1178@example 1179awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{} 1180@end example 1181 1182@menu 1183* One-shot:: Running a short throw-away @code{awk} program. 1184* Read Terminal:: Using no input files (input from terminal 1185 instead). 1186* Long:: Putting permanent @code{awk} programs in 1187 files. 1188* Executable Scripts:: Making self-contained @code{awk} programs. 1189* Comments:: Adding documentation to @code{gawk} programs. 1190@end menu 1191 1192@node One-shot, Read Terminal, Running gawk, Running gawk 1193@subsection One-shot Throw-away @code{awk} Programs 1194 1195Once you are familiar with @code{awk}, you will often type in simple 1196programs the moment you want to use them. Then you can write the 1197program as the first argument of the @code{awk} command, like this: 1198 1199@example 1200awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} 1201@end example 1202 1203@noindent 1204where @var{program} consists of a series of @var{patterns} and 1205@var{actions}, as described earlier. 1206 1207@cindex single quotes, why needed 1208This command format instructs the @dfn{shell}, or command interpreter, 1209to start @code{awk} and use the @var{program} to process records in the 1210input file(s). There are single quotes around @var{program} so that 1211the shell doesn't interpret any @code{awk} characters as special shell 1212characters. They also cause the shell to treat all of @var{program} as 1213a single argument for @code{awk} and allow @var{program} to be more 1214than one line long. 1215 1216This format is also useful for running short or medium-sized @code{awk} 1217programs from shell scripts, because it avoids the need for a separate 1218file for the @code{awk} program. A self-contained shell script is more 1219reliable since there are no other files to misplace. 1220 1221@ref{One-liners, , Useful One Line Programs}, presents several short, 1222self-contained programs. 1223 1224As an interesting side point, the command 1225 1226@example 1227awk '/foo/' @var{files} @dots{} 1228@end example 1229 1230@noindent 1231is essentially the same as 1232 1233@cindex @code{egrep} 1234@example 1235egrep foo @var{files} @dots{} 1236@end example 1237 1238@node Read Terminal, Long, One-shot, Running gawk 1239@subsection Running @code{awk} without Input Files 1240 1241@cindex standard input 1242@cindex input, standard 1243You can also run @code{awk} without any input files. If you type the 1244command line: 1245 1246@example 1247awk '@var{program}' 1248@end example 1249 1250@noindent 1251then @code{awk} applies the @var{program} to the @dfn{standard input}, 1252which usually means whatever you type on the terminal. This continues 1253until you indicate end-of-file by typing @kbd{Control-d}. 1254(On other operating systems, the end-of-file character may be different. 1255For example, on OS/2 and MS-DOS, it is @kbd{Control-z}.) 1256 1257For example, the following program prints a friendly piece of advice 1258(from Douglas Adams' @cite{The Hitchhiker's Guide to the Galaxy}), 1259to keep you from worrying about the complexities of computer programming 1260(@samp{BEGIN} is a feature we haven't discussed yet). 1261 1262@example 1263$ awk "BEGIN @{ print \"Don't Panic!\" @}" 1264@print{} Don't Panic! 1265@end example 1266 1267@cindex quoting, shell 1268@cindex shell quoting 1269This program does not read any input. The @samp{\} before each of the 1270inner double quotes is necessary because of the shell's quoting rules, 1271in particular because it mixes both single quotes and double quotes. 1272 1273This next simple @code{awk} program 1274emulates the @code{cat} utility; it copies whatever you type at the 1275keyboard to its standard output. (Why this works is explained shortly.) 1276 1277@example 1278$ awk '@{ print @}' 1279Now is the time for all good men 1280@print{} Now is the time for all good men 1281to come to the aid of their country. 1282@print{} to come to the aid of their country. 1283Four score and seven years ago, ... 1284@print{} Four score and seven years ago, ... 1285What, me worry? 1286@print{} What, me worry? 1287@kbd{Control-d} 1288@end example 1289 1290@node Long, Executable Scripts, Read Terminal, Running gawk 1291@subsection Running Long Programs 1292 1293@cindex running long programs 1294@cindex @code{-f} option 1295@cindex program file 1296@cindex file, @code{awk} program 1297Sometimes your @code{awk} programs can be very long. In this case it is 1298more convenient to put the program into a separate file. To tell 1299@code{awk} to use that file for its program, you type: 1300 1301@example 1302awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{} 1303@end example 1304 1305The @samp{-f} instructs the @code{awk} utility to get the @code{awk} program 1306from the file @var{source-file}. Any file name can be used for 1307@var{source-file}. For example, you could put the program: 1308 1309@example 1310BEGIN @{ print "Don't Panic!" @} 1311@end example 1312 1313@noindent 1314into the file @file{advice}. Then this command: 1315 1316@example 1317awk -f advice 1318@end example 1319 1320@noindent 1321does the same thing as this one: 1322 1323@example 1324awk "BEGIN @{ print \"Don't Panic!\" @}" 1325@end example 1326 1327@cindex quoting, shell 1328@cindex shell quoting 1329@noindent 1330which was explained earlier (@pxref{Read Terminal, ,Running @code{awk} without Input Files}). 1331Note that you don't usually need single quotes around the file name that you 1332specify with @samp{-f}, because most file names don't contain any of the shell's 1333special characters. Notice that in @file{advice}, the @code{awk} 1334program did not have single quotes around it. The quotes are only needed 1335for programs that are provided on the @code{awk} command line. 1336 1337If you want to identify your @code{awk} program files clearly as such, 1338you can add the extension @file{.awk} to the file name. This doesn't 1339affect the execution of the @code{awk} program, but it does make 1340``housekeeping'' easier. 1341 1342@node Executable Scripts, Comments, Long, Running gawk 1343@subsection Executable @code{awk} Programs 1344@cindex executable scripts 1345@cindex scripts, executable 1346@cindex self contained programs 1347@cindex program, self contained 1348@cindex @code{#!} (executable scripts) 1349 1350Once you have learned @code{awk}, you may want to write self-contained 1351@code{awk} scripts, using the @samp{#!} script mechanism. You can do 1352this on many Unix systems@footnote{The @samp{#!} mechanism works on 1353Linux systems, 1354Unix systems derived from Berkeley Unix, System V Release 4, and some System 1355V Release 3 systems.} (and someday on the GNU system). 1356 1357For example, you could update the file @file{advice} to look like this: 1358 1359@example 1360#! /bin/awk -f 1361 1362BEGIN @{ print "Don't Panic!" @} 1363@end example 1364 1365@noindent 1366After making this file executable (with the @code{chmod} utility), you 1367can simply type @samp{advice} 1368at the shell, and the system will arrange to run @code{awk}@footnote{The 1369line beginning with @samp{#!} lists the full file name of an interpreter 1370to be run, and an optional initial command line argument to pass to that 1371interpreter. The operating system then runs the interpreter with the given 1372argument and the full argument list of the executed program. The first argument 1373in the list is the full file name of the @code{awk} program. The rest of the 1374argument list will either be options to @code{awk}, or data files, 1375or both.} as if you had typed @samp{awk -f advice}. 1376 1377@example 1378@group 1379$ advice 1380@print{} Don't Panic! 1381@end group 1382@end example 1383 1384@noindent 1385Self-contained @code{awk} scripts are useful when you want to write a 1386program which users can invoke without their having to know that the program is 1387written in @code{awk}. 1388 1389@strong{Caution:} You should not put more than one argument on the @samp{#!} 1390line after the path to @code{awk}. This will not work. The operating system 1391treats the rest of the line as a single agument, and passes it to @code{awk}. 1392Doing this will lead to confusing behavior: most likely a usage diagnostic 1393of some sort from @code{awk}. 1394 1395@cindex shell scripts 1396@cindex scripts, shell 1397Some older systems do not support the @samp{#!} mechanism. You can get a 1398similar effect using a regular shell script. It would look something 1399like this: 1400 1401@example 1402: The colon ensures execution by the standard shell. 1403awk '@var{program}' "$@@" 1404@end example 1405 1406Using this technique, it is @emph{vital} to enclose the @var{program} in 1407single quotes to protect it from interpretation by the shell. If you 1408omit the quotes, only a shell wizard can predict the results. 1409 1410The @code{"$@@"} causes the shell to forward all the command line 1411arguments to the @code{awk} program, without interpretation. The first 1412line, which starts with a colon, is used so that this shell script will 1413work even if invoked by a user who uses the C shell. (Not all older systems 1414obey this convention, but many do.) 1415@c 2e: 1416@c Someday: (See @cite{The Bourne Again Shell}, by ??.) 1417 1418@node Comments, , Executable Scripts, Running gawk 1419@subsection Comments in @code{awk} Programs 1420@cindex @code{#} (comment) 1421@cindex comments 1422@cindex use of comments 1423@cindex documenting @code{awk} programs 1424@cindex programs, documenting 1425 1426A @dfn{comment} is some text that is included in a program for the sake 1427of human readers; it is not really part of the program. Comments 1428can explain what the program does, and how it works. Nearly all 1429programming languages have provisions for comments, because programs are 1430typically hard to understand without their extra help. 1431 1432In the @code{awk} language, a comment starts with the sharp sign 1433character, @samp{#}, and continues to the end of the line. 1434The @samp{#} does not have to be the first character on the line. The 1435@code{awk} language ignores the rest of a line following a sharp sign. 1436For example, we could have put the following into @file{advice}: 1437 1438@example 1439# This program prints a nice friendly message. It helps 1440# keep novice users from being afraid of the computer. 1441BEGIN @{ print "Don't Panic!" @} 1442@end example 1443 1444You can put comment lines into keyboard-composed throw-away @code{awk} 1445programs also, but this usually isn't very useful; the purpose of a 1446comment is to help you or another person understand the program at 1447a later time. 1448 1449@strong{Caution:} As mentioned in 1450@ref{One-shot, ,One-shot Throw-away @code{awk} Programs}, 1451you can enclose small to medium programs in single quotes, in order to keep 1452your shell scripts self-contained. When doing so, @emph{don't} put 1453an apostrophe (i.e., a single quote) into a comment (or anywhere else 1454in your program). The shell will interpret the quote as the closing 1455quote for the entire program. As a result, usually the shell will 1456print a message about mismatched quotes, and if @code{awk} actually 1457runs, it will probably print strange messages about syntax errors. 1458For example: 1459 1460@example 1461awk 'BEGIN @{ print "hello" @} # let's be cute' 1462@end example 1463 1464@node Very Simple, Two Rules, Running gawk, Getting Started 1465@section A Very Simple Example 1466 1467The following command runs a simple @code{awk} program that searches the 1468input file @file{BBS-list} for the string of characters: @samp{foo}. (A 1469string of characters is usually called a @dfn{string}. 1470The term @dfn{string} is perhaps based on similar usage in English, such 1471as ``a string of pearls,'' or, ``a string of cars in a train.'') 1472 1473@example 1474awk '/foo/ @{ print $0 @}' BBS-list 1475@end example 1476 1477@noindent 1478When lines containing @samp{foo} are found, they are printed, because 1479@w{@samp{print $0}} means print the current line. (Just @samp{print} by 1480itself means the same thing, so we could have written that 1481instead.) 1482 1483You will notice that slashes, @samp{/}, surround the string @samp{foo} 1484in the @code{awk} program. The slashes indicate that @samp{foo} 1485is a pattern to search for. This type of pattern is called a 1486@dfn{regular expression}, and is covered in more detail later 1487(@pxref{Regexp, ,Regular Expressions}). 1488The pattern is allowed to match parts of words. 1489There are 1490single-quotes around the @code{awk} program so that the shell won't 1491interpret any of it as special shell characters. 1492 1493Here is what this program prints: 1494 1495@example 1496@group 1497$ awk '/foo/ @{ print $0 @}' BBS-list 1498@print{} fooey 555-1234 2400/1200/300 B 1499@print{} foot 555-6699 1200/300 B 1500@print{} macfoo 555-6480 1200/300 A 1501@print{} sabafoo 555-2127 1200/300 C 1502@end group 1503@end example 1504 1505@cindex action, default 1506@cindex pattern, default 1507@cindex default action 1508@cindex default pattern 1509In an @code{awk} rule, either the pattern or the action can be omitted, 1510but not both. If the pattern is omitted, then the action is performed 1511for @emph{every} input line. If the action is omitted, the default 1512action is to print all lines that match the pattern. 1513 1514@cindex empty action 1515@cindex action, empty 1516Thus, we could leave out the action (the @code{print} statement and the curly 1517braces) in the above example, and the result would be the same: all 1518lines matching the pattern @samp{foo} would be printed. By comparison, 1519omitting the @code{print} statement but retaining the curly braces makes an 1520empty action that does nothing; then no lines would be printed. 1521 1522@node Two Rules, More Complex, Very Simple, Getting Started 1523@section An Example with Two Rules 1524@cindex how @code{awk} works 1525 1526The @code{awk} utility reads the input files one line at a 1527time. For each line, @code{awk} tries the patterns of each of the rules. 1528If several patterns match then several actions are run, in the order in 1529which they appear in the @code{awk} program. If no patterns match, then 1530no actions are run. 1531 1532After processing all the rules (perhaps none) that match the line, 1533@code{awk} reads the next line (however, 1534@pxref{Next Statement, ,The @code{next} Statement}, 1535and also @pxref{Nextfile Statement, ,The @code{nextfile} Statement}). 1536This continues until the end of the file is reached. 1537 1538For example, the @code{awk} program: 1539 1540@example 1541/12/ @{ print $0 @} 1542/21/ @{ print $0 @} 1543@end example 1544 1545@noindent 1546contains two rules. The first rule has the string @samp{12} as the 1547pattern and @samp{print $0} as the action. The second rule has the 1548string @samp{21} as the pattern and also has @samp{print $0} as the 1549action. Each rule's action is enclosed in its own pair of braces. 1550 1551This @code{awk} program prints every line that contains the string 1552@samp{12} @emph{or} the string @samp{21}. If a line contains both 1553strings, it is printed twice, once by each rule. 1554 1555This is what happens if we run this program on our two sample data files, 1556@file{BBS-list} and @file{inventory-shipped}, as shown here: 1557 1558@example 1559$ awk '/12/ @{ print $0 @} 1560> /21/ @{ print $0 @}' BBS-list inventory-shipped 1561@print{} aardvark 555-5553 1200/300 B 1562@print{} alpo-net 555-3412 2400/1200/300 A 1563@print{} barfly 555-7685 1200/300 A 1564@print{} bites 555-1675 2400/1200/300 A 1565@print{} core 555-2912 1200/300 C 1566@print{} fooey 555-1234 2400/1200/300 B 1567@print{} foot 555-6699 1200/300 B 1568@print{} macfoo 555-6480 1200/300 A 1569@print{} sdace 555-3430 2400/1200/300 A 1570@print{} sabafoo 555-2127 1200/300 C 1571@print{} sabafoo 555-2127 1200/300 C 1572@print{} Jan 21 36 64 620 1573@print{} Apr 21 70 74 514 1574@end example 1575 1576@noindent 1577Note how the line in @file{BBS-list} beginning with @samp{sabafoo} 1578was printed twice, once for each rule. 1579 1580@node More Complex, Statements/Lines, Two Rules, Getting Started 1581@section A More Complex Example 1582 1583@ignore 1584We have to use ls -lg here to get portable output across Unix systems. 1585The POSIX ls matches this behavior too. Sigh. 1586@end ignore 1587Here is an example to give you an idea of what typical @code{awk} 1588programs do. This example shows how @code{awk} can be used to 1589summarize, select, and rearrange the output of another utility. It uses 1590features that haven't been covered yet, so don't worry if you don't 1591understand all the details. 1592 1593@example 1594ls -lg | awk '$6 == "Nov" @{ sum += $5 @} 1595 END @{ print sum @}' 1596@end example 1597 1598@cindex @code{csh}, backslash continuation 1599@cindex backslash continuation in @code{csh} 1600This command prints the total number of bytes in all the files in the 1601current directory that were last modified in November (of any year). 1602(In the C shell you would need to type a semicolon and then a backslash 1603at the end of the first line; in a POSIX-compliant shell, such as the 1604Bourne shell or Bash, the GNU Bourne-Again shell, you can type the example 1605as shown.) 1606@ignore 1607FIXME: how can users tell what shell they are running? Need a footnote 1608or something, but getting into this is a distraction. 1609@end ignore 1610 1611The @w{@samp{ls -lg}} part of this example is a system command that gives 1612you a listing of the files in a directory, including file size and the date 1613the file was last modified. Its output looks like this: 1614 1615@example 1616-rw-r--r-- 1 arnold user 1933 Nov 7 13:05 Makefile 1617-rw-r--r-- 1 arnold user 10809 Nov 7 13:03 gawk.h 1618-rw-r--r-- 1 arnold user 983 Apr 13 12:14 gawk.tab.h 1619-rw-r--r-- 1 arnold user 31869 Jun 15 12:20 gawk.y 1620-rw-r--r-- 1 arnold user 22414 Nov 7 13:03 gawk1.c 1621-rw-r--r-- 1 arnold user 37455 Nov 7 13:03 gawk2.c 1622-rw-r--r-- 1 arnold user 27511 Dec 9 13:07 gawk3.c 1623-rw-r--r-- 1 arnold user 7989 Nov 7 13:03 gawk4.c 1624@end example 1625 1626@noindent 1627The first field contains read-write permissions, the second field contains 1628the number of links to the file, and the third field identifies the owner of 1629the file. The fourth field identifies the group of the file. 1630The fifth field contains the size of the file in bytes. The 1631sixth, seventh and eighth fields contain the month, day, and time, 1632respectively, that the file was last modified. Finally, the ninth field 1633contains the name of the file. 1634 1635@cindex automatic initialization 1636@cindex initialization, automatic 1637The @samp{$6 == "Nov"} in our @code{awk} program is an expression that 1638tests whether the sixth field of the output from @w{@samp{ls -lg}} 1639matches the string @samp{Nov}. Each time a line has the string 1640@samp{Nov} for its sixth field, the action @samp{sum += $5} is 1641performed. This adds the fifth field (the file size) to the variable 1642@code{sum}. As a result, when @code{awk} has finished reading all the 1643input lines, @code{sum} is the sum of the sizes of files whose 1644lines matched the pattern. (This works because @code{awk} variables 1645are automatically initialized to zero.) 1646 1647After the last line of output from @code{ls} has been processed, the 1648@code{END} rule is executed, and the value of @code{sum} is 1649printed. In this example, the value of @code{sum} would be 80600. 1650 1651These more advanced @code{awk} techniques are covered in later sections 1652(@pxref{Action Overview, ,Overview of Actions}). Before you can move on to more 1653advanced @code{awk} programming, you have to know how @code{awk} interprets 1654your input and displays your output. By manipulating fields and using 1655@code{print} statements, you can produce some very useful and impressive 1656looking reports. 1657 1658@node Statements/Lines, Other Features, More Complex, Getting Started 1659@section @code{awk} Statements Versus Lines 1660@cindex line break 1661@cindex newline 1662 1663Most often, each line in an @code{awk} program is a separate statement or 1664separate rule, like this: 1665 1666@example 1667awk '/12/ @{ print $0 @} 1668 /21/ @{ print $0 @}' BBS-list inventory-shipped 1669@end example 1670 1671However, @code{gawk} will ignore newlines after any of the following: 1672 1673@example 1674, @{ ? : || && do else 1675@end example 1676 1677@noindent 1678A newline at any other point is considered the end of the statement. 1679(Splitting lines after @samp{?} and @samp{:} is a minor @code{gawk} 1680extension. The @samp{?} and @samp{:} referred to here is the 1681three operand conditional expression described in 1682@ref{Conditional Exp, ,Conditional Expressions}.) 1683 1684@cindex backslash continuation 1685@cindex continuation of lines 1686@cindex line continuation 1687If you would like to split a single statement into two lines at a point 1688where a newline would terminate it, you can @dfn{continue} it by ending the 1689first line with a backslash character, @samp{\}. The backslash must be 1690the final character on the line to be recognized as a continuation 1691character. This is allowed absolutely anywhere in the statement, even 1692in the middle of a string or regular expression. For example: 1693 1694@example 1695awk '/This regular expression is too long, so continue it\ 1696 on the next line/ @{ print $1 @}' 1697@end example 1698 1699@noindent 1700@cindex portability issues 1701We have generally not used backslash continuation in the sample programs 1702in this @value{DOCUMENT}. Since in @code{gawk} there is no limit on the 1703length of a line, it is never strictly necessary; it just makes programs 1704more readable. For this same reason, as well as for clarity, we have 1705kept most statements short in the sample programs presented throughout 1706the @value{DOCUMENT}. Backslash continuation is most useful when your 1707@code{awk} program is in a separate source file, instead of typed in on 1708the command line. You should also note that many @code{awk} 1709implementations are more particular about where you may use backslash 1710continuation. For example, they may not allow you to split a string 1711constant using backslash continuation. Thus, for maximal portability of 1712your @code{awk} programs, it is best not to split your lines in the 1713middle of a regular expression or a string. 1714 1715@cindex @code{csh}, backslash continuation 1716@cindex backslash continuation in @code{csh} 1717@strong{Caution: backslash continuation does not work as described above 1718with the C shell.} Continuation with backslash works for @code{awk} 1719programs in files, and also for one-shot programs @emph{provided} you 1720are using a POSIX-compliant shell, such as the Bourne shell or Bash, the 1721GNU Bourne-Again shell. But the C shell (@code{csh}) behaves 1722differently! There, you must use two backslashes in a row, followed by 1723a newline. Note also that when using the C shell, @emph{every} newline 1724in your awk program must be escaped with a backslash. To illustrate: 1725 1726@example 1727% awk 'BEGIN @{ \ 1728? print \\ 1729? "hello, world" \ 1730? @}' 1731@print{} hello, world 1732@end example 1733 1734@noindent 1735Here, the @samp{%} and @samp{?} are the C shell's primary and secondary 1736prompts, analogous to the standard shell's @samp{$} and @samp{>}. 1737 1738@code{awk} is a line-oriented language. Each rule's action has to 1739begin on the same line as the pattern. To have the pattern and action 1740on separate lines, you @emph{must} use backslash continuation---there 1741is no other way. 1742 1743@cindex backslash continuation and comments 1744@cindex comments and backslash continuation 1745Note that backslash continuation and comments do not mix. As soon 1746as @code{awk} sees the @samp{#} that starts a comment, it ignores 1747@emph{everything} on the rest of the line. For example: 1748 1749@example 1750@group 1751$ gawk 'BEGIN @{ print "dont panic" # a friendly \ 1752> BEGIN rule 1753> @}' 1754@error{} gawk: cmd. line:2: BEGIN rule 1755@error{} gawk: cmd. line:2: ^ parse error 1756@end group 1757@end example 1758 1759@noindent 1760Here, it looks like the backslash would continue the comment onto the 1761next line. However, the backslash-newline combination is never even 1762noticed, since it is ``hidden'' inside the comment. Thus, the 1763@samp{BEGIN} is noted as a syntax error. 1764 1765@cindex multiple statements on one line 1766When @code{awk} statements within one rule are short, you might want to put 1767more than one of them on a line. You do this by separating the statements 1768with a semicolon, @samp{;}. 1769 1770This also applies to the rules themselves. 1771Thus, the previous program could have been written: 1772 1773@example 1774/12/ @{ print $0 @} ; /21/ @{ print $0 @} 1775@end example 1776 1777@noindent 1778@strong{Note:} the requirement that rules on the same line must be 1779separated with a semicolon was not in the original @code{awk} 1780language; it was added for consistency with the treatment of statements 1781within an action. 1782 1783@node Other Features, When, Statements/Lines, Getting Started 1784@section Other Features of @code{awk} 1785 1786The @code{awk} language provides a number of predefined, or built-in variables, which 1787your programs can use to get information from @code{awk}. There are other 1788variables your program can set to control how @code{awk} processes your 1789data. 1790 1791In addition, @code{awk} provides a number of built-in functions for doing 1792common computational and string related operations. 1793 1794As we develop our presentation of the @code{awk} language, we introduce 1795most of the variables and many of the functions. They are defined 1796systematically in @ref{Built-in Variables}, and 1797@ref{Built-in, ,Built-in Functions}. 1798 1799@node When, , Other Features, Getting Started 1800@section When to Use @code{awk} 1801 1802@cindex when to use @code{awk} 1803@cindex applications of @code{awk} 1804You might wonder how @code{awk} might be useful for you. Using 1805utility programs, advanced patterns, field separators, arithmetic 1806statements, and other selection criteria, you can produce much more 1807complex output. The @code{awk} language is very useful for producing 1808reports from large amounts of raw data, such as summarizing information 1809from the output of other utility programs like @code{ls}. 1810(@xref{More Complex, ,A More Complex Example}.) 1811 1812Programs written with @code{awk} are usually much smaller than they would 1813be in other languages. This makes @code{awk} programs easy to compose and 1814use. Often, @code{awk} programs can be quickly composed at your terminal, 1815used once, and thrown away. Since @code{awk} programs are interpreted, you 1816can avoid the (usually lengthy) compilation part of the typical 1817edit-compile-test-debug cycle of software development. 1818 1819Complex programs have been written in @code{awk}, including a complete 1820retargetable assembler for eight-bit microprocessors (@pxref{Glossary}, for 1821more information) and a microcode assembler for a special purpose Prolog 1822computer. However, @code{awk}'s capabilities are strained by tasks of 1823such complexity. 1824 1825If you find yourself writing @code{awk} scripts of more than, say, a few 1826hundred lines, you might consider using a different programming 1827language. Emacs Lisp is a good choice if you need sophisticated string 1828or pattern matching capabilities. The shell is also good at string and 1829pattern matching; in addition, it allows powerful use of the system 1830utilities. More conventional languages, such as C, C++, and Lisp, offer 1831better facilities for system programming and for managing the complexity 1832of large programs. Programs in these languages may require more lines 1833of source code than the equivalent @code{awk} programs, but they are 1834easier to maintain and usually run more efficiently. 1835 1836@node One-liners, Regexp, Getting Started, Top 1837@chapter Useful One Line Programs 1838 1839@cindex one-liners 1840Many useful @code{awk} programs are short, just a line or two. Here is a 1841collection of useful, short programs to get you started. Some of these 1842programs contain constructs that haven't been covered yet. The description 1843of the program will give you a good idea of what is going on, but please 1844read the rest of the @value{DOCUMENT} to become an @code{awk} expert! 1845 1846Most of the examples use a data file named @file{data}. This is just a 1847placeholder; if you were to use these programs yourself, you would substitute 1848your own file names for @file{data}. 1849 1850@ifinfo 1851Since you are reading this in Info, each line of the example code is 1852enclosed in quotes, to represent text that you would type literally. 1853The examples themselves represent shell commands that use single quotes 1854to keep the shell from interpreting the contents of the program. 1855When reading the examples, focus on the text between the open and close 1856quotes. 1857@end ifinfo 1858 1859@table @code 1860@item awk '@{ if (length($0) > max) max = length($0) @} 1861@itemx @ @ @ @ @ END @{ print max @}' data 1862This program prints the length of the longest input line. 1863 1864@item awk 'length($0) > 80' data 1865This program prints every line that is longer than 80 characters. The sole 1866rule has a relational expression as its pattern, and has no action (so the 1867default action, printing the record, is used). 1868 1869@item expand@ data@ |@ awk@ '@{ if (x < length()) x = length() @} 1870@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "maximum line length is " x @}' 1871This program prints the length of the longest line in @file{data}. The input 1872is processed by the @code{expand} program to change tabs into spaces, 1873so the widths compared are actually the right-margin columns. 1874 1875@item awk 'NF > 0' data 1876This program prints every line that has at least one field. This is an 1877easy way to delete blank lines from a file (or rather, to create a new 1878file similar to the old file but from which the blank lines have been 1879deleted). 1880 1881@c Karl Berry points out that new users probably don't want to see 1882@c multiple ways to do things, just the `best' way. He's probably 1883@c right. At some point it might be worth adding something about there 1884@c often being multiple ways to do things in awk, but for now we'll 1885@c just take this one out. 1886@ignore 1887@item awk '@{ if (NF > 0) print @}' data 1888This program also prints every line that has at least one field. Here we 1889allow the rule to match every line, and then decide in the action whether 1890to print. 1891@end ignore 1892 1893@item awk@ 'BEGIN@ @{@ for (i = 1; i <= 7; i++) 1894@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ print int(101 * rand()) @}' 1895This program prints seven random numbers from zero to 100, inclusive. 1896 1897@item ls -lg @var{files} | awk '@{ x += $5 @} ; END @{ print "total bytes: " x @}' 1898This program prints the total number of bytes used by @var{files}. 1899 1900@item ls -lg @var{files} | awk '@{ x += $5 @} 1901@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "total K-bytes: " (x + 1023)/1024 @}' 1902This program prints the total number of kilobytes used by @var{files}. 1903 1904@item awk -F: '@{ print $1 @}' /etc/passwd | sort 1905This program prints a sorted list of the login names of all users. 1906 1907@item awk 'END @{ print NR @}' data 1908This program counts lines in a file. 1909 1910@item awk 'NR % 2 == 0' data 1911This program prints the even numbered lines in the data file. 1912If you were to use the expression @samp{NR % 2 == 1} instead, 1913it would print the odd numbered lines. 1914@end table 1915 1916@node Regexp, Reading Files, One-liners, Top 1917@chapter Regular Expressions 1918@cindex pattern, regular expressions 1919@cindex regexp 1920@cindex regular expression 1921@cindex regular expressions as patterns 1922 1923A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a 1924set of strings. 1925Because regular expressions are such a fundamental part of @code{awk} 1926programming, their format and use deserve a separate chapter. 1927 1928A regular expression enclosed in slashes (@samp{/}) 1929is an @code{awk} pattern that matches every input record whose text 1930belongs to that set. 1931 1932The simplest regular expression is a sequence of letters, numbers, or 1933both. Such a regexp matches any string that contains that sequence. 1934Thus, the regexp @samp{foo} matches any string containing @samp{foo}. 1935Therefore, the pattern @code{/foo/} matches any input record containing 1936the three characters @samp{foo}, @emph{anywhere} in the record. Other 1937kinds of regexps let you specify more complicated classes of strings. 1938 1939@iftex 1940Initially, the examples will be simple. As we explain more about how 1941regular expressions work, we will present more complicated examples. 1942@end iftex 1943 1944@menu 1945* Regexp Usage:: How to Use Regular Expressions. 1946* Escape Sequences:: How to write non-printing characters. 1947* Regexp Operators:: Regular Expression Operators. 1948* GNU Regexp Operators:: Operators specific to GNU software. 1949* Case-sensitivity:: How to do case-insensitive matching. 1950* Leftmost Longest:: How much text matches. 1951* Computed Regexps:: Using Dynamic Regexps. 1952@end menu 1953 1954@node Regexp Usage, Escape Sequences, Regexp, Regexp 1955@section How to Use Regular Expressions 1956 1957A regular expression can be used as a pattern by enclosing it in 1958slashes. Then the regular expression is tested against the 1959entire text of each record. (Normally, it only needs 1960to match some part of the text in order to succeed.) For example, this 1961prints the second field of each record that contains the three 1962characters @samp{foo} anywhere in it: 1963 1964@example 1965@group 1966$ awk '/foo/ @{ print $2 @}' BBS-list 1967@print{} 555-1234 1968@print{} 555-6699 1969@print{} 555-6480 1970@print{} 555-2127 1971@end group 1972@end example 1973 1974@cindex regexp matching operators 1975@cindex string-matching operators 1976@cindex operators, string-matching 1977@cindex operators, regexp matching 1978@cindex regexp match/non-match operators 1979@cindex @code{~} operator 1980@cindex @code{!~} operator 1981Regular expressions can also be used in matching expressions. These 1982expressions allow you to specify the string to match against; it need 1983not be the entire current input record. The two operators, @samp{~} 1984and @samp{!~}, perform regular expression comparisons. Expressions 1985using these operators can be used as patterns or in @code{if}, 1986@code{while}, @code{for}, and @code{do} statements. 1987@ifinfo 1988@c adding this xref in TeX screws up the formatting too much 1989(@xref{Statements, ,Control Statements in Actions}.) 1990@end ifinfo 1991 1992@table @code 1993@item @var{exp} ~ /@var{regexp}/ 1994This is true if the expression @var{exp} (taken as a string) 1995is matched by @var{regexp}. The following example matches, or selects, 1996all input records with the upper-case letter @samp{J} somewhere in the 1997first field: 1998 1999@example 2000@group 2001$ awk '$1 ~ /J/' inventory-shipped 2002@print{} Jan 13 25 15 115 2003@print{} Jun 31 42 75 492 2004@print{} Jul 24 34 67 436 2005@print{} Jan 21 36 64 620 2006@end group 2007@end example 2008 2009So does this: 2010 2011@example 2012awk '@{ if ($1 ~ /J/) print @}' inventory-shipped 2013@end example 2014 2015@item @var{exp} !~ /@var{regexp}/ 2016This is true if the expression @var{exp} (taken as a character string) 2017is @emph{not} matched by @var{regexp}. The following example matches, 2018or selects, all input records whose first field @emph{does not} contain 2019the upper-case letter @samp{J}: 2020 2021@example 2022@group 2023$ awk '$1 !~ /J/' inventory-shipped 2024@print{} Feb 15 32 24 226 2025@print{} Mar 15 24 34 228 2026@print{} Apr 31 52 63 420 2027@print{} May 16 34 29 208 2028@dots{} 2029@end group 2030@end example 2031@end table 2032 2033@cindex regexp constant 2034When a regexp is written enclosed in slashes, like @code{/foo/}, we call it 2035a @dfn{regexp constant}, much like @code{5.27} is a numeric constant, and 2036@code{"foo"} is a string constant. 2037 2038@node Escape Sequences, Regexp Operators, Regexp Usage, Regexp 2039@section Escape Sequences 2040 2041@cindex escape sequence notation 2042Some characters cannot be included literally in string constants 2043(@code{"foo"}) or regexp constants (@code{/foo/}). You represent them 2044instead with @dfn{escape sequences}, which are character sequences 2045beginning with a backslash (@samp{\}). 2046 2047One use of an escape sequence is to include a double-quote character in 2048a string constant. Since a plain double-quote would end the string, you 2049must use @samp{\"} to represent an actual double-quote character as a 2050part of the string. For example: 2051 2052@example 2053$ awk 'BEGIN @{ print "He said \"hi!\" to her." @}' 2054@print{} He said "hi!" to her. 2055@end example 2056 2057The backslash character itself is another character that cannot be 2058included normally; you write @samp{\\} to put one backslash in the 2059string or regexp. Thus, the string whose contents are the two characters 2060@samp{"} and @samp{\} must be written @code{"\"\\"}. 2061 2062Another use of backslash is to represent unprintable characters 2063such as tab or newline. While there is nothing to stop you from entering most 2064unprintable characters directly in a string constant or regexp constant, 2065they may look ugly. 2066 2067Here is a table of all the escape sequences used in @code{awk}, and 2068what they represent. Unless noted otherwise, all of these escape 2069sequences apply to both string constants and regexp constants. 2070 2071@c @cartouche 2072@table @code 2073@item \\ 2074A literal backslash, @samp{\}. 2075 2076@cindex @code{awk} language, V.4 version 2077@item \a 2078The ``alert'' character, @kbd{Control-g}, ASCII code 7 (BEL). 2079 2080@item \b 2081Backspace, @kbd{Control-h}, ASCII code 8 (BS). 2082 2083@item \f 2084Formfeed, @kbd{Control-l}, ASCII code 12 (FF). 2085 2086@item \n 2087Newline, @kbd{Control-j}, ASCII code 10 (LF). 2088 2089@item \r 2090Carriage return, @kbd{Control-m}, ASCII code 13 (CR). 2091 2092@item \t 2093Horizontal tab, @kbd{Control-i}, ASCII code 9 (HT). 2094 2095@cindex @code{awk} language, V.4 version 2096@item \v 2097Vertical tab, @kbd{Control-k}, ASCII code 11 (VT). 2098 2099@item \@var{nnn} 2100The octal value @var{nnn}, where @var{nnn} are one to three digits 2101between @samp{0} and @samp{7}. For example, the code for the ASCII ESC 2102(escape) character is @samp{\033}. 2103 2104@cindex @code{awk} language, V.4 version 2105@cindex @code{awk} language, POSIX version 2106@cindex POSIX @code{awk} 2107@item \x@var{hh}@dots{} 2108The hexadecimal value @var{hh}, where @var{hh} are hexadecimal 2109digits (@samp{0} through @samp{9} and either @samp{A} through @samp{F} or 2110@samp{a} through @samp{f}). Like the same construct in ANSI C, the escape 2111sequence continues until the first non-hexadecimal digit is seen. However, 2112using more than two hexadecimal digits produces undefined results. (The 2113@samp{\x} escape sequence is not allowed in POSIX @code{awk}.) 2114 2115@item \/ 2116A literal slash (necessary for regexp constants only). 2117You use this when you wish to write a regexp 2118constant that contains a slash. Since the regexp is delimited by 2119slashes, you need to escape the slash that is part of the pattern, 2120in order to tell @code{awk} to keep processing the rest of the regexp. 2121 2122@item \" 2123A literal double-quote (necessary for string constants only). 2124You use this when you wish to write a string 2125constant that contains a double-quote. Since the string is delimited by 2126double-quotes, you need to escape the quote that is part of the string, 2127in order to tell @code{awk} to keep processing the rest of the string. 2128@end table 2129@c @end cartouche 2130 2131In @code{gawk}, there are additional two character sequences that begin 2132with backslash that have special meaning in regexps. 2133@xref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}. 2134 2135In a string constant, 2136what happens if you place a backslash before something that is not one of 2137the characters listed above? POSIX @code{awk} purposely leaves this case 2138undefined. There are two choices. 2139 2140@itemize @bullet 2141@item 2142Strip the backslash out. This is what Unix @code{awk} and @code{gawk} both do. 2143For example, @code{"a\qc"} is the same as @code{"aqc"}. 2144 2145@item 2146Leave the backslash alone. Some other @code{awk} implementations do this. 2147In such implementations, @code{"a\qc"} is the same as if you had typed 2148@code{"a\\qc"}. 2149@end itemize 2150 2151In a regexp, a backslash before any character that is not in the above table, 2152and not listed in 2153@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}, 2154means that the next character should be taken literally, even if it would 2155normally be a regexp operator. E.g., @code{/a\+b/} matches the three 2156characters @samp{a+b}. 2157 2158@cindex portability issues 2159For complete portability, do not use a backslash before any character not 2160listed in the table above. 2161 2162Another interesting question arises. Suppose you use an octal or hexadecimal 2163escape to represent a regexp metacharacter 2164(@pxref{Regexp Operators, , Regular Expression Operators}). 2165Does @code{awk} treat the character as a literal character, or as a regexp 2166operator? 2167 2168@cindex dark corner 2169It turns out that historically, such characters were taken literally (d.c.). 2170However, the POSIX standard indicates that they should be treated 2171as real metacharacters, and this is what @code{gawk} does. 2172However, in compatibility mode (@pxref{Options, ,Command Line Options}), 2173@code{gawk} treats the characters represented by octal and hexadecimal 2174escape sequences literally when used in regexp constants. Thus, 2175@code{/a\52b/} is equivalent to @code{/a\*b/}. 2176 2177To summarize: 2178 2179@enumerate 1 2180@item 2181The escape sequences in the table above are always processed first, 2182for both string constants and regexp constants. This happens very early, 2183as soon as @code{awk} reads your program. 2184 2185@item 2186@code{gawk} processes both regexp constants and dynamic regexps 2187(@pxref{Computed Regexps, ,Using Dynamic Regexps}), 2188for the special operators listed in 2189@ref{GNU Regexp Operators, ,Additional Regexp Operators Only in @code{gawk}}. 2190 2191@item 2192A backslash before any other character means to treat that character 2193literally. 2194@end enumerate 2195 2196@node Regexp Operators, GNU Regexp Operators, Escape Sequences, Regexp 2197@section Regular Expression Operators 2198@cindex metacharacters 2199@cindex regular expression metacharacters 2200@cindex regexp operators 2201 2202You can combine regular expressions with the following characters, 2203called @dfn{regular expression operators}, or @dfn{metacharacters}, to 2204increase the power and versatility of regular expressions. 2205 2206The escape sequences described 2207@iftex 2208above 2209@end iftex 2210in @ref{Escape Sequences}, 2211are valid inside a regexp. They are introduced by a @samp{\}. They 2212are recognized and converted into the corresponding real characters as 2213the very first step in processing regexps. 2214 2215Here is a table of metacharacters. All characters that are not escape 2216sequences and that are not listed in the table stand for themselves. 2217 2218@table @code 2219@item \ 2220This is used to suppress the special meaning of a character when 2221matching. For example: 2222 2223@example 2224\$ 2225@end example 2226 2227@noindent 2228matches the character @samp{$}. 2229 2230@c NEEDED 2231@page 2232@cindex anchors in regexps 2233@cindex regexp, anchors 2234@item ^ 2235This matches the beginning of a string. For example: 2236 2237@example 2238^@@chapter 2239@end example 2240 2241@noindent 2242matches the @samp{@@chapter} at the beginning of a string, and can be used 2243to identify chapter beginnings in Texinfo source files. 2244The @samp{^} is known as an @dfn{anchor}, since it anchors the pattern to 2245matching only at the beginning of the string. 2246 2247It is important to realize that @samp{^} does not match the beginning of 2248a line embedded in a string. In this example the condition is not true: 2249 2250@example 2251if ("line1\nLINE 2" ~ /^L/) @dots{} 2252@end example 2253 2254@item $ 2255This is similar to @samp{^}, but it matches only at the end of a string. 2256For example: 2257 2258@example 2259p$ 2260@end example 2261 2262@noindent 2263matches a record that ends with a @samp{p}. The @samp{$} is also an anchor, 2264and also does not match the end of a line embedded in a string. In this 2265example the condition is not true: 2266 2267@example 2268if ("line1\nLINE 2" ~ /1$/) @dots{} 2269@end example 2270 2271@item . 2272The period, or dot, matches any single character, 2273@emph{including} the newline character. For example: 2274 2275@example 2276.P 2277@end example 2278 2279@noindent 2280matches any single character followed by a @samp{P} in a string. Using 2281concatenation we can make a regular expression like @samp{U.A}, which 2282matches any three-character sequence that begins with @samp{U} and ends 2283with @samp{A}. 2284 2285@cindex @code{awk} language, POSIX version 2286@cindex POSIX @code{awk} 2287In strict POSIX mode (@pxref{Options, ,Command Line Options}), 2288@samp{.} does not match the @sc{nul} 2289character, which is a character with all bits equal to zero. 2290Otherwise, @sc{nul} is just another character. Other versions of @code{awk} 2291may not be able to match the @sc{nul} character. 2292 2293@ignore 22942e: Add stuff that character list is the POSIX terminology. In other 2295 literature known as character set or character class. 2296@end ignore 2297 2298@cindex character list 2299@item [@dots{}] 2300This is called a @dfn{character list}. It matches any @emph{one} of the 2301characters that are enclosed in the square brackets. For example: 2302 2303@example 2304[MVX] 2305@end example 2306 2307@noindent 2308matches any one of the characters @samp{M}, @samp{V}, or @samp{X} in a 2309string. 2310 2311Ranges of characters are indicated by using a hyphen between the beginning 2312and ending characters, and enclosing the whole thing in brackets. For 2313example: 2314 2315@example 2316[0-9] 2317@end example 2318 2319@noindent 2320matches any digit. 2321Multiple ranges are allowed. E.g., the list @code{@w{[A-Za-z0-9]}} is a 2322common way to express the idea of ``all alphanumeric characters.'' 2323 2324To include one of the characters @samp{\}, @samp{]}, @samp{-} or @samp{^} in a 2325character list, put a @samp{\} in front of it. For example: 2326 2327@example 2328[d\]] 2329@end example 2330 2331@noindent 2332matches either @samp{d}, or @samp{]}. 2333 2334@cindex @code{egrep} 2335This treatment of @samp{\} in character lists 2336is compatible with other @code{awk} 2337implementations, and is also mandated by POSIX. 2338The regular expressions in @code{awk} are a superset 2339of the POSIX specification for Extended Regular Expressions (EREs). 2340POSIX EREs are based on the regular expressions accepted by the 2341traditional @code{egrep} utility. 2342 2343@cindex character classes 2344@cindex @code{awk} language, POSIX version 2345@cindex POSIX @code{awk} 2346@dfn{Character classes} are a new feature introduced in the POSIX standard. 2347A character class is a special notation for describing 2348lists of characters that have a specific attribute, but where the 2349actual characters themselves can vary from country to country and/or 2350from character set to character set. For example, the notion of what 2351is an alphabetic character differs in the USA and in France. 2352 2353A character class is only valid in a regexp @emph{inside} the 2354brackets of a character list. Character classes consist of @samp{[:}, 2355a keyword denoting the class, and @samp{:]}. Here are the character 2356classes defined by the POSIX standard. 2357 2358@table @code 2359@item [:alnum:] 2360Alphanumeric characters. 2361 2362@item [:alpha:] 2363Alphabetic characters. 2364 2365@item [:blank:] 2366Space and tab characters. 2367 2368@item [:cntrl:] 2369Control characters. 2370 2371@item [:digit:] 2372Numeric characters. 2373 2374@item [:graph:] 2375Characters that are printable and are also visible. 2376(A space is printable, but not visible, while an @samp{a} is both.) 2377 2378@item [:lower:] 2379Lower-case alphabetic characters. 2380 2381@item [:print:] 2382Printable characters (characters that are not control characters.) 2383 2384@item [:punct:] 2385Punctuation characters (characters that are not letter, digits, 2386control characters, or space characters). 2387 2388@item [:space:] 2389Space characters (such as space, tab, and formfeed, to name a few). 2390 2391@item [:upper:] 2392Upper-case alphabetic characters. 2393 2394@item [:xdigit:] 2395Characters that are hexadecimal digits. 2396@end table 2397 2398For example, before the POSIX standard, to match alphanumeric 2399characters, you had to write @code{/[A-Za-z0-9]/}. If your 2400character set had other alphabetic characters in it, this would not 2401match them. With the POSIX character classes, you can write 2402@code{/[[:alnum:]]/}, and this will match @emph{all} the alphabetic 2403and numeric characters in your character set. 2404 2405@cindex collating elements 2406Two additional special sequences can appear in character lists. 2407These apply to non-ASCII character sets, which can have single symbols 2408(called @dfn{collating elements}) that are represented with more than one 2409character, as well as several characters that are equivalent for 2410@dfn{collating}, or sorting, purposes. (E.g., in French, a plain ``e'' 2411and a grave-accented ``@`e'' are equivalent.) 2412 2413@table @asis 2414@cindex collating symbols 2415@item Collating Symbols 2416A @dfn{collating symbol} is a multi-character collating element enclosed in 2417@samp{[.} and @samp{.]}. For example, if @samp{ch} is a collating element, 2418then @code{[[.ch.]]} is a regexp that matches this collating element, while 2419@code{[ch]} is a regexp that matches either @samp{c} or @samp{h}. 2420 2421@cindex equivalence classes 2422@item Equivalence Classes 2423An @dfn{equivalence class} is a locale-specific name for a list of 2424characters that are equivalent. The name is enclosed in 2425@samp{[=} and @samp{=]}. 2426For example, the name @samp{e} might be used to represent all of 2427``e,'' ``@`e,'' and ``@'e.'' In this case, @code{[[=e]]} is a regexp 2428that matches any of @samp{e}, @samp{@'e}, or @samp{@`e}. 2429@end table 2430 2431These features are very valuable in non-English speaking locales. 2432 2433@strong{Caution:} The library functions that @code{gawk} uses for regular 2434expression matching currently only recognize POSIX character classes; 2435they do not recognize collating symbols or equivalence classes. 2436@c maybe one day ... 2437 2438@cindex complemented character list 2439@cindex character list, complemented 2440@item [^ @dots{}] 2441This is a @dfn{complemented character list}. The first character after 2442the @samp{[} @emph{must} be a @samp{^}. It matches any characters 2443@emph{except} those in the square brackets. For example: 2444 2445@example 2446[^0-9] 2447@end example 2448 2449@noindent 2450matches any character that is not a digit. 2451 2452@item | 2453This is the @dfn{alternation operator}, and it is used to specify 2454alternatives. For example: 2455 2456@example 2457^P|[0-9] 2458@end example 2459 2460@noindent 2461matches any string that matches either @samp{^P} or @samp{[0-9]}. This 2462means it matches any string that starts with @samp{P} or contains a digit. 2463 2464The alternation applies to the largest possible regexps on either side. 2465In other words, @samp{|} has the lowest precedence of all the regular 2466expression operators. 2467 2468@item (@dots{}) 2469Parentheses are used for grouping in regular expressions as in 2470arithmetic. They can be used to concatenate regular expressions 2471containing the alternation operator, @samp{|}. For example, 2472@samp{@@(samp|code)\@{[^@}]+\@}} matches both @samp{@@code@{foo@}} and 2473@samp{@@samp@{bar@}}. (These are Texinfo formatting control sequences.) 2474 2475@item * 2476This symbol means that the preceding regular expression is to be 2477repeated as many times as necessary to find a match. For example: 2478 2479@example 2480ph* 2481@end example 2482 2483@noindent 2484applies the @samp{*} symbol to the preceding @samp{h} and looks for matches 2485of one @samp{p} followed by any number of @samp{h}s. This will also match 2486just @samp{p} if no @samp{h}s are present. 2487 2488The @samp{*} repeats the @emph{smallest} possible preceding expression. 2489(Use parentheses if you wish to repeat a larger expression.) It finds 2490as many repetitions as possible. For example: 2491 2492@example 2493awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample 2494@end example 2495 2496@noindent 2497prints every record in @file{sample} containing a string of the form 2498@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on. 2499Notice the escaping of the parentheses by preceding them 2500with backslashes. 2501 2502@item + 2503This symbol is similar to @samp{*}, but the preceding expression must be 2504matched at least once. This means that: 2505 2506@example 2507wh+y 2508@end example 2509 2510@noindent 2511would match @samp{why} and @samp{whhy} but not @samp{wy}, whereas 2512@samp{wh*y} would match all three of these strings. This is a simpler 2513way of writing the last @samp{*} example: 2514 2515@example 2516awk '/\(c[ad]+r x\)/ @{ print @}' sample 2517@end example 2518 2519@item ? 2520This symbol is similar to @samp{*}, but the preceding expression can be 2521matched either once or not at all. For example: 2522 2523@example 2524fe?d 2525@end example 2526 2527@noindent 2528will match @samp{fed} and @samp{fd}, but nothing else. 2529 2530@cindex @code{awk} language, POSIX version 2531@cindex POSIX @code{awk} 2532@cindex interval expressions 2533@item @{@var{n}@} 2534@itemx @{@var{n},@} 2535@itemx @{@var{n},@var{m}@} 2536One or two numbers inside braces denote an @dfn{interval expression}. 2537If there is one number in the braces, the preceding regexp is repeated 2538@var{n} times. 2539If there are two numbers separated by a comma, the preceding regexp is 2540repeated @var{n} to @var{m} times. 2541If there is one number followed by a comma, then the preceding regexp 2542is repeated at least @var{n} times. 2543 2544@table @code 2545@item wh@{3@}y 2546matches @samp{whhhy} but not @samp{why} or @samp{whhhhy}. 2547 2548@item wh@{3,5@}y 2549matches @samp{whhhy} or @samp{whhhhy} or @samp{whhhhhy}, only. 2550 2551@item wh@{2,@}y 2552matches @samp{whhy} or @samp{whhhy}, and so on. 2553@end table 2554 2555Interval expressions were not traditionally available in @code{awk}. 2556As part of the POSIX standard they were added, to make @code{awk} 2557and @code{egrep} consistent with each other. 2558 2559However, since old programs may use @samp{@{} and @samp{@}} in regexp 2560constants, by default @code{gawk} does @emph{not} match interval expressions 2561in regexps. If either @samp{--posix} or @samp{--re-interval} are specified 2562(@pxref{Options, , Command Line Options}), then interval expressions 2563are allowed in regexps. 2564@end table 2565 2566@cindex precedence, regexp operators 2567@cindex regexp operators, precedence of 2568In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators, 2569as well as the braces @samp{@{} and @samp{@}}, 2570have 2571the highest precedence, followed by concatenation, and finally by @samp{|}. 2572As in arithmetic, parentheses can change how operators are grouped. 2573 2574If @code{gawk} is in compatibility mode 2575(@pxref{Options, ,Command Line Options}), 2576character classes and interval expressions are not available in 2577regular expressions. 2578 2579The next 2580@ifinfo 2581node 2582@end ifinfo 2583@iftex 2584section 2585@end iftex 2586discusses the GNU-specific regexp operators, and provides 2587more detail concerning how command line options affect the way @code{gawk} 2588interprets the characters in regular expressions. 2589 2590@node GNU Regexp Operators, Case-sensitivity, Regexp Operators, Regexp 2591@section Additional Regexp Operators Only in @code{gawk} 2592 2593@c This section adapted from the regex-0.12 manual 2594 2595@cindex regexp operators, GNU specific 2596GNU software that deals with regular expressions provides a number of 2597additional regexp operators. These operators are described in this 2598section, and are specific to @code{gawk}; they are not available in other 2599@code{awk} implementations. 2600 2601@cindex word, regexp definition of 2602Most of the additional operators are for dealing with word matching. 2603For our purposes, a @dfn{word} is a sequence of one or more letters, digits, 2604or underscores (@samp{_}). 2605 2606@table @code 2607@cindex @code{\w} regexp operator 2608@item \w 2609This operator matches any word-constituent character, i.e.@: any 2610letter, digit, or underscore. Think of it as a short-hand for 2611@c @w{@code{[A-Za-z0-9_]}} or 2612@w{@code{[[:alnum:]_]}}. 2613 2614@cindex @code{\W} regexp operator 2615@item \W 2616This operator matches any character that is not word-constituent. 2617Think of it as a short-hand for 2618@c @w{@code{[^A-Za-z0-9_]}} or 2619@w{@code{[^[:alnum:]_]}}. 2620 2621@cindex @code{\<} regexp operator 2622@item \< 2623This operator matches the empty string at the beginning of a word. 2624For example, @code{/\<away/} matches @samp{away}, but not 2625@samp{stowaway}. 2626 2627@cindex @code{\>} regexp operator 2628@item \> 2629This operator matches the empty string at the end of a word. 2630For example, @code{/stow\>/} matches @samp{stow}, but not @samp{stowaway}. 2631 2632@cindex @code{\y} regexp operator 2633@cindex word boundaries, matching 2634@item \y 2635This operator matches the empty string at either the beginning or the 2636end of a word (the word boundar@strong{y}). For example, @samp{\yballs?\y} 2637matches either @samp{ball} or @samp{balls} as a separate word. 2638 2639@cindex @code{\B} regexp operator 2640@item \B 2641This operator matches the empty string within a word. In other words, 2642@samp{\B} matches the empty string that occurs between two 2643word-constituent characters. For example, 2644@code{/\Brat\B/} matches @samp{crate}, but it does not match @samp{dirty rat}. 2645@samp{\B} is essentially the opposite of @samp{\y}. 2646@end table 2647 2648There are two other operators that work on buffers. In Emacs, a 2649@dfn{buffer} is, naturally, an Emacs buffer. For other programs, the 2650regexp library routines that @code{gawk} uses consider the entire 2651string to be matched as the buffer. 2652 2653For @code{awk}, since @samp{^} and @samp{$} always work in terms 2654of the beginning and end of strings, these operators don't add any 2655new capabilities. They are provided for compatibility with other GNU 2656software. 2657 2658@cindex buffer matching operators 2659@table @code 2660@cindex @code{\`} regexp operator 2661@item \` 2662This operator matches the empty string at the 2663beginning of the buffer. 2664 2665@cindex @code{\'} regexp operator 2666@item \' 2667This operator matches the empty string at the 2668end of the buffer. 2669@end table 2670 2671In other GNU software, the word boundary operator is @samp{\b}. However, 2672that conflicts with the @code{awk} language's definition of @samp{\b} 2673as backspace, so @code{gawk} uses a different letter. 2674 2675An alternative method would have been to require two backslashes in the 2676GNU operators, but this was deemed to be too confusing, and the current 2677method of using @samp{\y} for the GNU @samp{\b} appears to be the 2678lesser of two evils. 2679 2680@c NOTE!!! Keep this in sync with the same table in the summary appendix! 2681@cindex regexp, effect of command line options 2682The various command line options 2683(@pxref{Options, ,Command Line Options}) 2684control how @code{gawk} interprets characters in regexps. 2685 2686@table @asis 2687@item No options 2688In the default case, @code{gawk} provides all the facilities of 2689POSIX regexps and the GNU regexp operators described 2690@iftex 2691above. 2692@end iftex 2693@ifinfo 2694in @ref{Regexp Operators, ,Regular Expression Operators}. 2695@end ifinfo 2696However, interval expressions are not supported. 2697 2698@item @code{--posix} 2699Only POSIX regexps are supported, the GNU operators are not special 2700(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions 2701are allowed. 2702 2703@item @code{--traditional} 2704Traditional Unix @code{awk} regexps are matched. The GNU operators 2705are not special, interval expressions are not available, and neither 2706are the POSIX character classes (@code{[[:alnum:]]} and so on). 2707Characters described by octal and hexadecimal escape sequences are 2708treated literally, even if they represent regexp metacharacters. 2709 2710@item @code{--re-interval} 2711Allow interval expressions in regexps, even if @samp{--traditional} 2712has been provided. 2713@end table 2714 2715@node Case-sensitivity, Leftmost Longest, GNU Regexp Operators, Regexp 2716@section Case-sensitivity in Matching 2717 2718@cindex case sensitivity 2719@cindex ignoring case 2720Case is normally significant in regular expressions, both when matching 2721ordinary characters (i.e.@: not metacharacters), and inside character 2722sets. Thus a @samp{w} in a regular expression matches only a lower-case 2723@samp{w} and not an upper-case @samp{W}. 2724 2725The simplest way to do a case-independent match is to use a character 2726list: @samp{[Ww]}. However, this can be cumbersome if you need to use it 2727often; and it can make the regular expressions harder to 2728read. There are two alternatives that you might prefer. 2729 2730One way to do a case-insensitive match at a particular point in the 2731program is to convert the data to a single case, using the 2732@code{tolower} or @code{toupper} built-in string functions (which we 2733haven't discussed yet; 2734@pxref{String Functions, ,Built-in Functions for String Manipulation}). 2735For example: 2736 2737@example 2738tolower($1) ~ /foo/ @{ @dots{} @} 2739@end example 2740 2741@noindent 2742converts the first field to lower-case before matching against it. 2743This will work in any POSIX-compliant implementation of @code{awk}. 2744 2745@cindex differences between @code{gawk} and @code{awk} 2746@cindex @code{~} operator 2747@cindex @code{!~} operator 2748@vindex IGNORECASE 2749Another method, specific to @code{gawk}, is to set the variable 2750@code{IGNORECASE} to a non-zero value (@pxref{Built-in Variables}). 2751When @code{IGNORECASE} is not zero, @emph{all} regexp and string 2752operations ignore case. Changing the value of 2753@code{IGNORECASE} dynamically controls the case sensitivity of your 2754program as it runs. Case is significant by default because 2755@code{IGNORECASE} (like most variables) is initialized to zero. 2756 2757@example 2758@group 2759x = "aB" 2760if (x ~ /ab/) @dots{} # this test will fail 2761@end group 2762 2763@group 2764IGNORECASE = 1 2765if (x ~ /ab/) @dots{} # now it will succeed 2766@end group 2767@end example 2768 2769In general, you cannot use @code{IGNORECASE} to make certain rules 2770case-insensitive and other rules case-sensitive, because there is no way 2771to set @code{IGNORECASE} just for the pattern of a particular rule. 2772@ignore 2773This isn't quite true. Consider: 2774 2775 IGNORECASE=1 && /foObAr/ { .... } 2776 IGNORECASE=0 || /foobar/ { .... } 2777 2778But that's pretty bad style and I don't want to get into it at this 2779late date. 2780@end ignore 2781To do this, you must use character lists or @code{tolower}. However, one 2782thing you can do only with @code{IGNORECASE} is turn case-sensitivity on 2783or off dynamically for all the rules at once. 2784 2785@code{IGNORECASE} can be set on the command line, or in a @code{BEGIN} rule 2786(@pxref{Other Arguments, ,Other Command Line Arguments}; also 2787@pxref{Using BEGIN/END, ,Startup and Cleanup Actions}). 2788Setting @code{IGNORECASE} from the command line is a way to make 2789a program case-insensitive without having to edit it. 2790 2791Prior to version 3.0 of @code{gawk}, the value of @code{IGNORECASE} 2792only affected regexp operations. It did not affect string comparison 2793with @samp{==}, @samp{!=}, and so on. 2794Beginning with version 3.0, both regexp and string comparison 2795operations are affected by @code{IGNORECASE}. 2796 2797@cindex ISO 8859-1 2798@cindex ISO Latin-1 2799Beginning with version 3.0 of @code{gawk}, the equivalences between upper-case 2800and lower-case characters are based on the ISO-8859-1 (ISO Latin-1) 2801character set. This character set is a superset of the traditional 128 2802ASCII characters, that also provides a number of characters suitable 2803for use with European languages. 2804@ignore 2805A pure ASCII character set can be used instead if @code{gawk} is compiled 2806with @samp{-DUSE_PURE_ASCII}. 2807@end ignore 2808 2809The value of @code{IGNORECASE} has no effect if @code{gawk} is in 2810compatibility mode (@pxref{Options, ,Command Line Options}). 2811Case is always significant in compatibility mode. 2812 2813@node Leftmost Longest, Computed Regexps, Case-sensitivity, Regexp 2814@section How Much Text Matches? 2815 2816@cindex leftmost longest match 2817@cindex matching, leftmost longest 2818Consider the following example: 2819 2820@example 2821echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}' 2822@end example 2823 2824This example uses the @code{sub} function (which we haven't discussed yet, 2825@pxref{String Functions, ,Built-in Functions for String Manipulation}) 2826to make a change to the input record. Here, the regexp @code{/a+/} 2827indicates ``one or more @samp{a} characters,'' and the replacement 2828text is @samp{<A>}. 2829 2830The input contains four @samp{a} characters. What will the output be? 2831In other words, how many is ``one or more''---will @code{awk} match two, 2832three, or all four @samp{a} characters? 2833 2834The answer is, @code{awk} (and POSIX) regular expressions always match 2835the leftmost, @emph{longest} sequence of input characters that can 2836match. Thus, in this example, all four @samp{a} characters are 2837replaced with @samp{<A>}. 2838 2839@example 2840$ echo aaaabcd | awk '@{ sub(/a+/, "<A>"); print @}' 2841@print{} <A>bcd 2842@end example 2843 2844For simple match/no-match tests, this is not so important. But when doing 2845text matching and substitutions with the @code{match}, @code{sub}, @code{gsub}, 2846and @code{gensub} functions, it is very important. 2847@ifinfo 2848@xref{String Functions, ,Built-in Functions for String Manipulation}, 2849for more information on these functions. 2850@end ifinfo 2851Understanding this principle is also important for regexp-based record 2852and field splitting (@pxref{Records, ,How Input is Split into Records}, 2853and also @pxref{Field Separators, ,Specifying How Fields are Separated}). 2854 2855@node Computed Regexps, , Leftmost Longest, Regexp 2856@section Using Dynamic Regexps 2857 2858@cindex computed regular expressions 2859@cindex regular expressions, computed 2860@cindex dynamic regular expressions 2861@cindex regexp, dynamic 2862@cindex @code{~} operator 2863@cindex @code{!~} operator 2864The right hand side of a @samp{~} or @samp{!~} operator need not be a 2865regexp constant (i.e.@: a string of characters between slashes). It may 2866be any expression. The expression is evaluated, and converted if 2867necessary to a string; the contents of the string are used as the 2868regexp. A regexp that is computed in this way is called a @dfn{dynamic 2869regexp}. For example: 2870 2871@example 2872BEGIN @{ identifier_regexp = "[A-Za-z_][A-Za-z_0-9]*" @} 2873$0 ~ identifier_regexp @{ print @} 2874@end example 2875 2876@noindent 2877sets @code{identifier_regexp} to a regexp that describes @code{awk} 2878variable names, and tests if the input record matches this regexp. 2879 2880@ignore 2881Do we want to use "^[A-Za-z_][A-Za-z_0-9]*$" to restrict the entire 2882record to just identifiers? Doing that also would disrupt the flow of 2883the text. 2884@end ignore 2885 2886@strong{Caution:} When using the @samp{~} and @samp{!~} 2887operators, there is a difference between a regexp constant 2888enclosed in slashes, and a string constant enclosed in double quotes. 2889If you are going to use a string constant, you have to understand that 2890the string is in essence scanned @emph{twice}; the first time when 2891@code{awk} reads your program, and the second time when it goes to 2892match the string on the left-hand side of the operator with the pattern 2893on the right. This is true of any string valued expression (such as 2894@code{identifier_regexp} above), not just string constants. 2895 2896@cindex regexp constants, difference between slashes and quotes 2897What difference does it make if the string is 2898scanned twice? The answer has to do with escape sequences, and particularly 2899with backslashes. To get a backslash into a regular expression inside a 2900string, you have to type two backslashes. 2901 2902For example, @code{/\*/} is a regexp constant for a literal @samp{*}. 2903Only one backslash is needed. To do the same thing with a string, 2904you would have to type @code{"\\*"}. The first backslash escapes the 2905second one, so that the string actually contains the 2906two characters @samp{\} and @samp{*}. 2907 2908@cindex common mistakes 2909@cindex mistakes, common 2910@cindex errors, common 2911Given that you can use both regexp and string constants to describe 2912regular expressions, which should you use? The answer is ``regexp 2913constants,'' for several reasons. 2914 2915@enumerate 1 2916@item 2917String constants are more complicated to write, and 2918more difficult to read. Using regexp constants makes your programs 2919less error-prone. Not understanding the difference between the two 2920kinds of constants is a common source of errors. 2921 2922@item 2923It is also more efficient to use regexp constants: @code{awk} can note 2924that you have supplied a regexp and store it internally in a form that 2925makes pattern matching more efficient. When using a string constant, 2926@code{awk} must first convert the string into this internal form, and 2927then perform the pattern matching. 2928 2929@item 2930Using regexp constants is better style; it shows clearly that you 2931intend a regexp match. 2932@end enumerate 2933 2934@node Reading Files, Printing, Regexp, Top 2935@chapter Reading Input Files 2936 2937@cindex reading files 2938@cindex input 2939@cindex standard input 2940@vindex FILENAME 2941In the typical @code{awk} program, all input is read either from the 2942standard input (by default the keyboard, but often a pipe from another 2943command) or from files whose names you specify on the @code{awk} command 2944line. If you specify input files, @code{awk} reads them in order, reading 2945all the data from one before going on to the next. The name of the current 2946input file can be found in the built-in variable @code{FILENAME} 2947(@pxref{Built-in Variables}). 2948 2949The input is read in units called @dfn{records}, and processed by the 2950rules of your program one record at a time. 2951By default, each record is one line. Each 2952record is automatically split into chunks called @dfn{fields}. 2953This makes it more convenient for programs to work on the parts of a record. 2954 2955On rare occasions you will need to use the @code{getline} command. 2956The @code{getline} command is valuable, both because it 2957can do explicit input from any number of files, and because the files 2958used with it do not have to be named on the @code{awk} command line 2959(@pxref{Getline, ,Explicit Input with @code{getline}}). 2960 2961@menu 2962* Records:: Controlling how data is split into records. 2963* Fields:: An introduction to fields. 2964* Non-Constant Fields:: Non-constant Field Numbers. 2965* Changing Fields:: Changing the Contents of a Field. 2966* Field Separators:: The field separator and how to change it. 2967* Constant Size:: Reading constant width data. 2968* Multiple Line:: Reading multi-line records. 2969* Getline:: Reading files under explicit program control 2970 using the @code{getline} function. 2971@end menu 2972 2973@node Records, Fields, Reading Files, Reading Files 2974@section How Input is Split into Records 2975 2976@cindex record separator, @code{RS} 2977@cindex changing the record separator 2978@cindex record, definition of 2979@vindex RS 2980The @code{awk} utility divides the input for your @code{awk} 2981program into records and fields. 2982Records are separated by a character called the @dfn{record separator}. 2983By default, the record separator is the newline character. 2984This is why records are, by default, single lines. 2985You can use a different character for the record separator by 2986assigning the character to the built-in variable @code{RS}. 2987 2988You can change the value of @code{RS} in the @code{awk} program, 2989like any other variable, with the 2990assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}). 2991The new record-separator character should be enclosed in quotation marks, 2992which indicate 2993a string constant. Often the right time to do this is at the beginning 2994of execution, before any input has been processed, so that the very 2995first record will be read with the proper separator. To do this, use 2996the special @code{BEGIN} pattern 2997(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). For 2998example: 2999 3000@example 3001awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list 3002@end example 3003 3004@noindent 3005changes the value of @code{RS} to @code{"/"}, before reading any input. 3006This is a string whose first character is a slash; as a result, records 3007are separated by slashes. Then the input file is read, and the second 3008rule in the @code{awk} program (the action with no pattern) prints each 3009record. Since each @code{print} statement adds a newline at the end of 3010its output, the effect of this @code{awk} program is to copy the input 3011with each slash changed to a newline. Here are the results of running 3012the program on @file{BBS-list}: 3013 3014@example 3015@group 3016$ awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list 3017@print{} aardvark 555-5553 1200 3018@print{} 300 B 3019@print{} alpo-net 555-3412 2400 3020@print{} 1200 3021@print{} 300 A 3022@print{} barfly 555-7685 1200 3023@print{} 300 A 3024@print{} bites 555-1675 2400 3025@print{} 1200 3026@print{} 300 A 3027@print{} camelot 555-0542 300 C 3028@print{} core 555-2912 1200 3029@print{} 300 C 3030@print{} fooey 555-1234 2400 3031@print{} 1200 3032@print{} 300 B 3033@print{} foot 555-6699 1200 3034@print{} 300 B 3035@print{} macfoo 555-6480 1200 3036@print{} 300 A 3037@print{} sdace 555-3430 2400 3038@print{} 1200 3039@print{} 300 A 3040@print{} sabafoo 555-2127 1200 3041@print{} 300 C 3042@print{} 3043@end group 3044@end example 3045 3046@noindent 3047Note that the entry for the @samp{camelot} BBS is not split. 3048In the original data file 3049(@pxref{Sample Data Files, , Data Files for the Examples}), 3050the line looks like this: 3051 3052@example 3053camelot 555-0542 300 C 3054@end example 3055 3056@noindent 3057It only has one baud rate; there are no slashes in the record. 3058 3059Another way to change the record separator is on the command line, 3060using the variable-assignment feature 3061(@pxref{Other Arguments, ,Other Command Line Arguments}). 3062 3063@example 3064awk '@{ print $0 @}' RS="/" BBS-list 3065@end example 3066 3067@noindent 3068This sets @code{RS} to @samp{/} before processing @file{BBS-list}. 3069 3070Using an unusual character such as @samp{/} for the record separator 3071produces correct behavior in the vast majority of cases. However, 3072the following (extreme) pipeline prints a surprising @samp{1}. There 3073is one field, consisting of a newline. The value of the built-in 3074variable @code{NF} is the number of fields in the current record. 3075 3076@example 3077@group 3078$ echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}' 3079@print{} 1 3080@end group 3081@end example 3082 3083@cindex dark corner 3084@noindent 3085Reaching the end of an input file terminates the current input record, 3086even if the last character in the file is not the character in @code{RS} 3087(d.c.). 3088 3089@cindex empty string 3090The empty string, @code{""} (a string of no characters), has a special meaning 3091as the value of @code{RS}: it means that records are separated 3092by one or more blank lines, and nothing else. 3093@xref{Multiple Line, ,Multiple-Line Records}, for more details. 3094 3095If you change the value of @code{RS} in the middle of an @code{awk} run, 3096the new value is used to delimit subsequent records, but the record 3097currently being processed (and records already processed) are not 3098affected. 3099 3100@vindex RT 3101@cindex record terminator, @code{RT} 3102@cindex terminator, record 3103@cindex differences between @code{gawk} and @code{awk} 3104After the end of the record has been determined, @code{gawk} 3105sets the variable @code{RT} to the text in the input that matched 3106@code{RS}. 3107 3108@cindex regular expressions as record separators 3109The value of @code{RS} is in fact not limited to a one-character 3110string. It can be any regular expression 3111(@pxref{Regexp, ,Regular Expressions}). 3112In general, each record 3113ends at the next string that matches the regular expression; the next 3114record starts at the end of the matching string. This general rule is 3115actually at work in the usual case, where @code{RS} contains just a 3116newline: a record ends at the beginning of the next matching string (the 3117next newline in the input) and the following record starts just after 3118the end of this string (at the first character of the following line). 3119The newline, since it matches @code{RS}, is not part of either record. 3120 3121When @code{RS} is a single character, @code{RT} will 3122contain the same single character. However, when @code{RS} is a 3123regular expression, then @code{RT} becomes more useful; it contains 3124the actual input text that matched the regular expression. 3125 3126The following example illustrates both of these features. 3127It sets @code{RS} equal to a regular expression that 3128matches either a newline, or a series of one or more upper-case letters 3129with optional leading and/or trailing white space 3130(@pxref{Regexp, , Regular Expressions}). 3131 3132@example 3133$ echo record 1 AAAA record 2 BBBB record 3 | 3134> gawk 'BEGIN @{ RS = "\n|( *[[:upper:]]+ *)" @} 3135> @{ print "Record =", $0, "and RT =", RT @}' 3136@print{} Record = record 1 and RT = AAAA 3137@print{} Record = record 2 and RT = BBBB 3138@print{} Record = record 3 and RT = 3139@print{} 3140@end example 3141 3142@noindent 3143The final line of output has an extra blank line. This is because the 3144value of @code{RT} is a newline, and then the @code{print} statement 3145supplies its own terminating newline. 3146 3147@xref{Simple Sed, ,A Simple Stream Editor}, for a more useful example 3148of @code{RS} as a regexp and @code{RT}. 3149 3150@cindex differences between @code{gawk} and @code{awk} 3151The use of @code{RS} as a regular expression and the @code{RT} 3152variable are @code{gawk} extensions; they are not available in 3153compatibility mode 3154(@pxref{Options, ,Command Line Options}). 3155In compatibility mode, only the first character of the value of 3156@code{RS} is used to determine the end of the record. 3157 3158@cindex number of records, @code{NR}, @code{FNR} 3159@vindex NR 3160@vindex FNR 3161The @code{awk} utility keeps track of the number of records that have 3162been read so far from the current input file. This value is stored in a 3163built-in variable called @code{FNR}. It is reset to zero when a new 3164file is started. Another built-in variable, @code{NR}, is the total 3165number of input records read so far from all data files. It starts at zero 3166but is never automatically reset to zero. 3167 3168@node Fields, Non-Constant Fields, Records, Reading Files 3169@section Examining Fields 3170 3171@cindex examining fields 3172@cindex fields 3173@cindex accessing fields 3174When @code{awk} reads an input record, the record is 3175automatically separated or @dfn{parsed} by the interpreter into chunks 3176called @dfn{fields}. By default, fields are separated by whitespace, 3177like words in a line. 3178Whitespace in @code{awk} means any string of one or more spaces, 3179tabs or newlines;@footnote{In POSIX @code{awk}, newlines are not 3180considered whitespace for separating fields.} other characters such as 3181formfeed, and so on, that are 3182considered whitespace by other languages are @emph{not} considered 3183whitespace by @code{awk}. 3184 3185The purpose of fields is to make it more convenient for you to refer to 3186these pieces of the record. You don't have to use them---you can 3187operate on the whole record if you wish---but fields are what make 3188simple @code{awk} programs so powerful. 3189 3190@cindex @code{$} (field operator) 3191@cindex field operator @code{$} 3192To refer to a field in an @code{awk} program, you use a dollar-sign, 3193@samp{$}, followed by the number of the field you want. Thus, @code{$1} 3194refers to the first field, @code{$2} to the second, and so on. For 3195example, suppose the following is a line of input: 3196 3197@example 3198This seems like a pretty nice example. 3199@end example 3200 3201@noindent 3202Here the first field, or @code{$1}, is @samp{This}; the second field, or 3203@code{$2}, is @samp{seems}; and so on. Note that the last field, 3204@code{$7}, is @samp{example.}. Because there is no space between the 3205@samp{e} and the @samp{.}, the period is considered part of the seventh 3206field. 3207 3208@vindex NF 3209@cindex number of fields, @code{NF} 3210@code{NF} is a built-in variable whose value 3211is the number of fields in the current record. 3212@code{awk} updates the value of @code{NF} automatically, each time 3213a record is read. 3214 3215No matter how many fields there are, the last field in a record can be 3216represented by @code{$NF}. So, in the example above, @code{$NF} would 3217be the same as @code{$7}, which is @samp{example.}. Why this works is 3218explained below (@pxref{Non-Constant Fields, ,Non-constant Field Numbers}). 3219If you try to reference a field beyond the last one, such as @code{$8} 3220when the record has only seven fields, you get the empty string. 3221@c the empty string acts like 0 in some contexts, but I don't want to 3222@c get into that here.... 3223 3224@code{$0}, which looks like a reference to the ``zeroth'' field, is 3225a special case: it represents the whole input record. @code{$0} is 3226used when you are not interested in fields. 3227 3228@c NEEDED 3229@page 3230Here are some more examples: 3231 3232@example 3233@group 3234$ awk '$1 ~ /foo/ @{ print $0 @}' BBS-list 3235@print{} fooey 555-1234 2400/1200/300 B 3236@print{} foot 555-6699 1200/300 B 3237@print{} macfoo 555-6480 1200/300 A 3238@print{} sabafoo 555-2127 1200/300 C 3239@end group 3240@end example 3241 3242@noindent 3243This example prints each record in the file @file{BBS-list} whose first 3244field contains the string @samp{foo}. The operator @samp{~} is called a 3245@dfn{matching operator} 3246(@pxref{Regexp Usage, , How to Use Regular Expressions}); 3247it tests whether a string (here, the field @code{$1}) matches a given regular 3248expression. 3249 3250By contrast, the following example 3251looks for @samp{foo} in @emph{the entire record} and prints the first 3252field and the last field for each input record containing a 3253match. 3254 3255@example 3256@group 3257$ awk '/foo/ @{ print $1, $NF @}' BBS-list 3258@print{} fooey B 3259@print{} foot B 3260@print{} macfoo A 3261@print{} sabafoo C 3262@end group 3263@end example 3264 3265@node Non-Constant Fields, Changing Fields, Fields, Reading Files 3266@section Non-constant Field Numbers 3267 3268The number of a field does not need to be a constant. Any expression in 3269the @code{awk} language can be used after a @samp{$} to refer to a 3270field. The value of the expression specifies the field number. If the 3271value is a string, rather than a number, it is converted to a number. 3272Consider this example: 3273 3274@example 3275awk '@{ print $NR @}' 3276@end example 3277 3278@noindent 3279Recall that @code{NR} is the number of records read so far: one in the 3280first record, two in the second, etc. So this example prints the first 3281field of the first record, the second field of the second record, and so 3282on. For the twentieth record, field number 20 is printed; most likely, 3283the record has fewer than 20 fields, so this prints a blank line. 3284 3285Here is another example of using expressions as field numbers: 3286 3287@example 3288awk '@{ print $(2*2) @}' BBS-list 3289@end example 3290 3291@code{awk} must evaluate the expression @samp{(2*2)} and use 3292its value as the number of the field to print. The @samp{*} sign 3293represents multiplication, so the expression @samp{2*2} evaluates to four. 3294The parentheses are used so that the multiplication is done before the 3295@samp{$} operation; they are necessary whenever there is a binary 3296operator in the field-number expression. This example, then, prints the 3297hours of operation (the fourth field) for every line of the file 3298@file{BBS-list}. (All of the @code{awk} operators are listed, in 3299order of decreasing precedence, in 3300@ref{Precedence, , Operator Precedence (How Operators Nest)}.) 3301 3302If the field number you compute is zero, you get the entire record. 3303Thus, @code{$(2-2)} has the same value as @code{$0}. Negative field 3304numbers are not allowed; trying to reference one will usually terminate 3305your running @code{awk} program. (The POSIX standard does not define 3306what happens when you reference a negative field number. @code{gawk} 3307will notice this and terminate your program. Other @code{awk} 3308implementations may behave differently.) 3309 3310As mentioned in @ref{Fields, ,Examining Fields}, 3311the number of fields in the current record is stored in the built-in 3312variable @code{NF} (also @pxref{Built-in Variables}). The expression 3313@code{$NF} is not a special feature: it is the direct consequence of 3314evaluating @code{NF} and using its value as a field number. 3315 3316@node Changing Fields, Field Separators, Non-Constant Fields, Reading Files 3317@section Changing the Contents of a Field 3318 3319@cindex field, changing contents of 3320@cindex changing contents of a field 3321@cindex assignment to fields 3322You can change the contents of a field as seen by @code{awk} within an 3323@code{awk} program; this changes what @code{awk} perceives as the 3324current input record. (The actual input is untouched; @code{awk} @emph{never} 3325modifies the input file.) 3326 3327Consider this example and its output: 3328 3329@example 3330@group 3331$ awk '@{ $3 = $2 - 10; print $2, $3 @}' inventory-shipped 3332@print{} 13 3 3333@print{} 15 5 3334@print{} 15 5 3335@dots{} 3336@end group 3337@end example 3338 3339@noindent 3340The @samp{-} sign represents subtraction, so this program reassigns 3341field three, @code{$3}, to be the value of field two minus ten, 3342@samp{$2 - 10}. (@xref{Arithmetic Ops, ,Arithmetic Operators}.) 3343Then field two, and the new value for field three, are printed. 3344 3345In order for this to work, the text in field @code{$2} must make sense 3346as a number; the string of characters must be converted to a number in 3347order for the computer to do arithmetic on it. The number resulting 3348from the subtraction is converted back to a string of characters which 3349then becomes field three. 3350@xref{Conversion, ,Conversion of Strings and Numbers}. 3351 3352When you change the value of a field (as perceived by @code{awk}), the 3353text of the input record is recalculated to contain the new field where 3354the old one was. Therefore, @code{$0} changes to reflect the altered 3355field. Thus, this program 3356prints a copy of the input file, with 10 subtracted from the second 3357field of each line. 3358 3359@example 3360@group 3361$ awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped 3362@print{} Jan 3 25 15 115 3363@print{} Feb 5 32 24 226 3364@print{} Mar 5 24 34 228 3365@dots{} 3366@end group 3367@end example 3368 3369You can also assign contents to fields that are out of range. For 3370example: 3371 3372@example 3373$ awk '@{ $6 = ($5 + $4 + $3 + $2) 3374> print $6 @}' inventory-shipped 3375@print{} 168 3376@print{} 297 3377@print{} 301 3378@dots{} 3379@end example 3380 3381@noindent 3382We've just created @code{$6}, whose value is the sum of fields 3383@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign 3384represents addition. For the file @file{inventory-shipped}, @code{$6} 3385represents the total number of parcels shipped for a particular month. 3386 3387Creating a new field changes @code{awk}'s internal copy of the current 3388input record---the value of @code{$0}. Thus, if you do @samp{print $0} 3389after adding a field, the record printed includes the new field, with 3390the appropriate number of field separators between it and the previously 3391existing fields. 3392 3393This recomputation affects and is affected by 3394@code{NF} (the number of fields; @pxref{Fields, ,Examining Fields}), 3395and by a feature that has not been discussed yet, 3396the @dfn{output field separator}, @code{OFS}, 3397which is used to separate the fields (@pxref{Output Separators}). 3398For example, the value of @code{NF} is set to the number of the highest 3399field you create. 3400 3401Note, however, that merely @emph{referencing} an out-of-range field 3402does @emph{not} change the value of either @code{$0} or @code{NF}. 3403Referencing an out-of-range field only produces an empty string. For 3404example: 3405 3406@example 3407if ($(NF+1) != "") 3408 print "can't happen" 3409else 3410 print "everything is normal" 3411@end example 3412 3413@noindent 3414should print @samp{everything is normal}, because @code{NF+1} is certain 3415to be out of range. (@xref{If Statement, ,The @code{if}-@code{else} Statement}, 3416for more information about @code{awk}'s @code{if-else} statements. 3417@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}, 3418for more information about the @samp{!=} operator.) 3419 3420It is important to note that making an assignment to an existing field 3421will change the 3422value of @code{$0}, but will not change the value of @code{NF}, 3423even when you assign the empty string to a field. For example: 3424 3425@example 3426@group 3427$ echo a b c d | awk '@{ OFS = ":"; $2 = "" 3428> print $0; print NF @}' 3429@print{} a::c:d 3430@print{} 4 3431@end group 3432@end example 3433 3434@noindent 3435The field is still there; it just has an empty value. You can tell 3436because there are two colons in a row. 3437 3438This example shows what happens if you create a new field. 3439 3440@example 3441$ echo a b c d | awk '@{ OFS = ":"; $2 = ""; $6 = "new" 3442> print $0; print NF @}' 3443@print{} a::c:d::new 3444@print{} 6 3445@end example 3446 3447@noindent 3448The intervening field, @code{$5} is created with an empty value 3449(indicated by the second pair of adjacent colons), 3450and @code{NF} is updated with the value six. 3451 3452Finally, decrementing @code{NF} will lose the values of the fields 3453after the new value of @code{NF}, and @code{$0} will be recomputed. 3454Here is an example: 3455 3456@example 3457$ echo a b c d e f | ../gawk '@{ print "NF =", NF; 3458> NF = 3; print $0 @}' 3459@print{} NF = 6 3460@print{} a b c 3461@end example 3462 3463@node Field Separators, Constant Size, Changing Fields, Reading Files 3464@section Specifying How Fields are Separated 3465 3466This section is rather long; it describes one of the most fundamental 3467operations in @code{awk}. 3468 3469@menu 3470* Basic Field Splitting:: How fields are split with single characters 3471 or simple strings. 3472* Regexp Field Splitting:: Using regexps as the field separator. 3473* Single Character Fields:: Making each character a separate field. 3474* Command Line Field Separator:: Setting @code{FS} from the command line. 3475* Field Splitting Summary:: Some final points and a summary table. 3476@end menu 3477 3478@node Basic Field Splitting, Regexp Field Splitting, Field Separators, Field Separators 3479@subsection The Basics of Field Separating 3480@vindex FS 3481@cindex fields, separating 3482@cindex field separator, @code{FS} 3483 3484The @dfn{field separator}, which is either a single character or a regular 3485expression, controls the way @code{awk} splits an input record into fields. 3486@code{awk} scans the input record for character sequences that 3487match the separator; the fields themselves are the text between the matches. 3488 3489In the examples below, we use the bullet symbol ``@bullet{}'' to represent 3490spaces in the output. 3491 3492If the field separator is @samp{oo}, then the following line: 3493 3494@example 3495moo goo gai pan 3496@end example 3497 3498@noindent 3499would be split into three fields: @samp{m}, @samp{@bullet{}g} and 3500@samp{@bullet{}gai@bullet{}pan}. 3501Note the leading spaces in the values of the second and third fields. 3502 3503@cindex common mistakes 3504@cindex mistakes, common 3505@cindex errors, common 3506The field separator is represented by the built-in variable @code{FS}. 3507Shell programmers take note! @code{awk} does @emph{not} use the name @code{IFS} 3508which is used by the POSIX compatible shells (such as the Bourne shell, 3509@code{sh}, or the GNU Bourne-Again Shell, Bash). 3510 3511You can change the value of @code{FS} in the @code{awk} program with the 3512assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}). 3513Often the right time to do this is at the beginning of execution, 3514before any input has been processed, so that the very first record 3515will be read with the proper separator. To do this, use the special 3516@code{BEGIN} pattern 3517(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). 3518For example, here we set the value of @code{FS} to the string 3519@code{","}: 3520 3521@example 3522awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}' 3523@end example 3524 3525@noindent 3526Given the input line, 3527 3528@example 3529John Q. Smith, 29 Oak St., Walamazoo, MI 42139 3530@end example 3531 3532@noindent 3533this @code{awk} program extracts and prints the string 3534@samp{@bullet{}29@bullet{}Oak@bullet{}St.}. 3535 3536@cindex field separator, choice of 3537@cindex regular expressions as field separators 3538Sometimes your input data will contain separator characters that don't 3539separate fields the way you thought they would. For instance, the 3540person's name in the example we just used might have a title or 3541suffix attached, such as @samp{John Q. Smith, LXIX}. From input 3542containing such a name: 3543 3544@example 3545John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 3546@end example 3547 3548@noindent 3549@c careful of an overfull hbox here! 3550the above program would extract @samp{@bullet{}LXIX}, instead of 3551@samp{@bullet{}29@bullet{}Oak@bullet{}St.}. 3552If you were expecting the program to print the 3553address, you would be surprised. The moral is: choose your data layout and 3554separator characters carefully to prevent such problems. 3555 3556@iftex 3557As you know, normally, 3558@end iftex 3559@ifinfo 3560Normally, 3561@end ifinfo 3562fields are separated by whitespace sequences 3563(spaces, tabs and newlines), not by single spaces: two spaces in a row do not 3564delimit an empty field. The default value of the field separator @code{FS} 3565is a string containing a single space, @w{@code{" "}}. If this value were 3566interpreted in the usual way, each space character would separate 3567fields, so two spaces in a row would make an empty field between them. 3568The reason this does not happen is that a single space as the value of 3569@code{FS} is a special case: it is taken to specify the default manner 3570of delimiting fields. 3571 3572If @code{FS} is any other single character, such as @code{","}, then 3573each occurrence of that character separates two fields. Two consecutive 3574occurrences delimit an empty field. If the character occurs at the 3575beginning or the end of the line, that too delimits an empty field. The 3576space character is the only single character which does not follow these 3577rules. 3578 3579@node Regexp Field Splitting, Single Character Fields, Basic Field Splitting, Field Separators 3580@subsection Using Regular Expressions to Separate Fields 3581 3582The previous 3583@iftex 3584subsection 3585@end iftex 3586@ifinfo 3587node 3588@end ifinfo 3589discussed the use of single characters or simple strings as the 3590value of @code{FS}. 3591More generally, the value of @code{FS} may be a string containing any 3592regular expression. In this case, each match in the record for the regular 3593expression separates fields. For example, the assignment: 3594 3595@example 3596FS = ", \t" 3597@end example 3598 3599@noindent 3600makes every area of an input line that consists of a comma followed by a 3601space and a tab, into a field separator. (@samp{\t} 3602is an @dfn{escape sequence} that stands for a tab; 3603@pxref{Escape Sequences}, 3604for the complete list of similar escape sequences.) 3605 3606For a less trivial example of a regular expression, suppose you want 3607single spaces to separate fields the way single commas were used above. 3608You can set @code{FS} to @w{@code{"[@ ]"}} (left bracket, space, right 3609bracket). This regular expression matches a single space and nothing else 3610(@pxref{Regexp, ,Regular Expressions}). 3611 3612There is an important difference between the two cases of @samp{FS = @w{" "}} 3613(a single space) and @samp{FS = @w{"[ \t\n]+"}} (left bracket, space, 3614backslash, ``t'', backslash, ``n'', right bracket, which is a regular 3615expression matching one or more spaces, tabs, or newlines). For both 3616values of @code{FS}, fields are separated by runs of spaces, tabs 3617and/or newlines. However, when the value of @code{FS} is @w{@code{" 3618"}}, @code{awk} will first strip leading and trailing whitespace from 3619the record, and then decide where the fields are. 3620 3621For example, the following pipeline prints @samp{b}: 3622 3623@example 3624@group 3625$ echo ' a b c d ' | awk '@{ print $2 @}' 3626@print{} b 3627@end group 3628@end example 3629 3630@noindent 3631However, this pipeline prints @samp{a} (note the extra spaces around 3632each letter): 3633 3634@example 3635$ echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t]+" @} 3636> @{ print $2 @}' 3637@print{} a 3638@end example 3639 3640@noindent 3641@cindex null string 3642@cindex empty string 3643In this case, the first field is @dfn{null}, or empty. 3644 3645The stripping of leading and trailing whitespace also comes into 3646play whenever @code{$0} is recomputed. For instance, study this pipeline: 3647 3648@example 3649$ echo ' a b c d' | awk '@{ print; $2 = $2; print @}' 3650@print{} a b c d 3651@print{} a b c d 3652@end example 3653 3654@noindent 3655The first @code{print} statement prints the record as it was read, 3656with leading whitespace intact. The assignment to @code{$2} rebuilds 3657@code{$0} by concatenating @code{$1} through @code{$NF} together, 3658separated by the value of @code{OFS}. Since the leading whitespace 3659was ignored when finding @code{$1}, it is not part of the new @code{$0}. 3660Finally, the last @code{print} statement prints the new @code{$0}. 3661 3662@node Single Character Fields, Command Line Field Separator, Regexp Field Splitting, Field Separators 3663@subsection Making Each Character a Separate Field 3664 3665@cindex differences between @code{gawk} and @code{awk} 3666@cindex single character fields 3667There are times when you may want to examine each character 3668of a record separately. In @code{gawk}, this is easy to do, you 3669simply assign the null string (@code{""}) to @code{FS}. In this case, 3670each individual character in the record will become a separate field. 3671Here is an example: 3672 3673@example 3674@group 3675$ echo a b | gawk 'BEGIN @{ FS = "" @} 3676> @{ 3677> for (i = 1; i <= NF; i = i + 1) 3678> print "Field", i, "is", $i 3679> @}' 3680@print{} Field 1 is a 3681@print{} Field 2 is 3682@print{} Field 3 is b 3683@end group 3684@end example 3685 3686@cindex dark corner 3687Traditionally, the behavior for @code{FS} equal to @code{""} was not defined. 3688In this case, Unix @code{awk} would simply treat the entire record 3689as only having one field (d.c.). In compatibility mode 3690(@pxref{Options, ,Command Line Options}), 3691if @code{FS} is the null string, then @code{gawk} will also 3692behave this way. 3693 3694@node Command Line Field Separator, Field Splitting Summary, Single Character Fields, Field Separators 3695@subsection Setting @code{FS} from the Command Line 3696@cindex @code{-F} option 3697@cindex field separator, on command line 3698@cindex command line, setting @code{FS} on 3699 3700@code{FS} can be set on the command line. You use the @samp{-F} option to 3701do so. For example: 3702 3703@example 3704awk -F, '@var{program}' @var{input-files} 3705@end example 3706 3707@noindent 3708sets @code{FS} to be the @samp{,} character. Notice that the option uses 3709a capital @samp{F}. Contrast this with @samp{-f}, which specifies a file 3710containing an @code{awk} program. Case is significant in command line options: 3711the @samp{-F} and @samp{-f} options have nothing to do with each other. 3712You can use both options at the same time to set the @code{FS} variable 3713@emph{and} get an @code{awk} program from a file. 3714 3715The value used for the argument to @samp{-F} is processed in exactly the 3716same way as assignments to the built-in variable @code{FS}. This means that 3717if the field separator contains special characters, they must be escaped 3718appropriately. For example, to use a @samp{\} as the field separator, you 3719would have to type: 3720 3721@example 3722# same as FS = "\\" 3723awk -F\\\\ '@dots{}' files @dots{} 3724@end example 3725 3726@noindent 3727Since @samp{\} is used for quoting in the shell, @code{awk} will see 3728@samp{-F\\}. Then @code{awk} processes the @samp{\\} for escape 3729characters (@pxref{Escape Sequences}), finally yielding 3730a single @samp{\} to be used for the field separator. 3731 3732@cindex historical features 3733As a special case, in compatibility mode 3734(@pxref{Options, ,Command Line Options}), if the 3735argument to @samp{-F} is @samp{t}, then @code{FS} is set to the tab 3736character. This is because if you type @samp{-F\t} at the shell, 3737without any quotes, the @samp{\} gets deleted, so @code{awk} figures that you 3738really want your fields to be separated with tabs, and not @samp{t}s. 3739Use @samp{-v FS="t"} on the command line if you really do want to separate 3740your fields with @samp{t}s 3741(@pxref{Options, ,Command Line Options}). 3742 3743For example, let's use an @code{awk} program file called @file{baud.awk} 3744that contains the pattern @code{/300/}, and the action @samp{print $1}. 3745Here is the program: 3746 3747@example 3748/300/ @{ print $1 @} 3749@end example 3750 3751Let's also set @code{FS} to be the @samp{-} character, and run the 3752program on the file @file{BBS-list}. The following command prints a 3753list of the names of the bulletin boards that operate at 300 baud and 3754the first three digits of their phone numbers: 3755 3756@c tweaked to make the tex output look better in @smallbook 3757@example 3758@group 3759$ awk -F- -f baud.awk BBS-list 3760@print{} aardvark 555 3761@print{} alpo 3762@print{} barfly 555 3763@dots{} 3764@end group 3765@ignore 3766@print{} bites 555 3767@print{} camelot 555 3768@print{} core 555 3769@print{} fooey 555 3770@print{} foot 555 3771@print{} macfoo 555 3772@print{} sdace 555 3773@print{} sabafoo 555 3774@end ignore 3775@end example 3776 3777@noindent 3778Note the second line of output. In the original file 3779(@pxref{Sample Data Files, ,Data Files for the Examples}), 3780the second line looked like this: 3781 3782@example 3783alpo-net 555-3412 2400/1200/300 A 3784@end example 3785 3786The @samp{-} as part of the system's name was used as the field 3787separator, instead of the @samp{-} in the phone number that was 3788originally intended. This demonstrates why you have to be careful in 3789choosing your field and record separators. 3790 3791On many Unix systems, each user has a separate entry in the system password 3792file, one line per user. The information in these lines is separated 3793by colons. The first field is the user's logon name, and the second is 3794the user's encrypted password. A password file entry might look like this: 3795 3796@example 3797arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh 3798@end example 3799 3800The following program searches the system password file, and prints 3801the entries for users who have no password: 3802 3803@example 3804awk -F: '$2 == ""' /etc/passwd 3805@end example 3806 3807@node Field Splitting Summary, , Command Line Field Separator, Field Separators 3808@subsection Field Splitting Summary 3809 3810@cindex @code{awk} language, POSIX version 3811@cindex POSIX @code{awk} 3812According to the POSIX standard, @code{awk} is supposed to behave 3813as if each record is split into fields at the time that it is read. 3814In particular, this means that you can change the value of @code{FS} 3815after a record is read, and the value of the fields (i.e.@: how they were split) 3816should reflect the old value of @code{FS}, not the new one. 3817 3818@cindex dark corner 3819@cindex @code{sed} utility 3820@cindex stream editor 3821However, many implementations of @code{awk} do not work this way. Instead, 3822they defer splitting the fields until a field is actually 3823referenced. The fields will be split 3824using the @emph{current} value of @code{FS}! (d.c.) 3825This behavior can be difficult 3826to diagnose. The following example illustrates the difference 3827between the two methods. 3828(The @code{sed}@footnote{The @code{sed} utility is a ``stream editor.'' 3829Its behavior is also defined by the POSIX standard.} 3830command prints just the first line of @file{/etc/passwd}.) 3831 3832@example 3833sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}' 3834@end example 3835 3836@noindent 3837will usually print 3838 3839@example 3840root 3841@end example 3842 3843@noindent 3844on an incorrect implementation of @code{awk}, while @code{gawk} 3845will print something like 3846 3847@example 3848root:nSijPlPhZZwgE:0:0:Root:/: 3849@end example 3850 3851The following table summarizes how fields are split, based on the 3852value of @code{FS}. (@samp{==} means ``is equal to.'') 3853 3854@c @cartouche 3855@table @code 3856@item FS == " " 3857Fields are separated by runs of whitespace. Leading and trailing 3858whitespace are ignored. This is the default. 3859 3860@item FS == @var{any other single character} 3861Fields are separated by each occurrence of the character. Multiple 3862successive occurrences delimit empty fields, as do leading and 3863trailing occurrences. 3864The character can even be a regexp metacharacter; it does not need 3865to be escaped. 3866 3867@item FS == @var{regexp} 3868Fields are separated by occurrences of characters that match @var{regexp}. 3869Leading and trailing matches of @var{regexp} delimit empty fields. 3870 3871@item FS == "" 3872Each individual character in the record becomes a separate field. 3873@end table 3874@c @end cartouche 3875 3876@node Constant Size, Multiple Line, Field Separators, Reading Files 3877@section Reading Fixed-width Data 3878 3879(This section discusses an advanced, experimental feature. If you are 3880a novice @code{awk} user, you may wish to skip it on the first reading.) 3881 3882@code{gawk} version 2.13 introduced a new facility for dealing with 3883fixed-width fields with no distinctive field separator. Data of this 3884nature arises, for example, in the input for old FORTRAN programs where 3885numbers are run together; or in the output of programs that did not 3886anticipate the use of their output as input for other programs. 3887 3888An example of the latter is a table where all the columns are lined up by 3889the use of a variable number of spaces and @emph{empty fields are just 3890spaces}. Clearly, @code{awk}'s normal field splitting based on @code{FS} 3891will not work well in this case. Although a portable @code{awk} program 3892can use a series of @code{substr} calls on @code{$0} 3893(@pxref{String Functions, ,Built-in Functions for String Manipulation}), 3894this is awkward and inefficient for a large number of fields. 3895 3896The splitting of an input record into fixed-width fields is specified by 3897assigning a string containing space-separated numbers to the built-in 3898variable @code{FIELDWIDTHS}. Each number specifies the width of the field 3899@emph{including} columns between fields. If you want to ignore the columns 3900between fields, you can specify the width as a separate field that is 3901subsequently ignored. 3902 3903The following data is the output of the Unix @code{w} utility. It is useful 3904to illustrate the use of @code{FIELDWIDTHS}. 3905 3906@example 3907@group 3908 10:06pm up 21 days, 14:04, 23 users 3909User tty login@ idle JCPU PCPU what 3910hzuo ttyV0 8:58pm 9 5 vi p24.tex 3911hzang ttyV3 6:37pm 50 -csh 3912eklye ttyV5 9:53pm 7 1 em thes.tex 3913dportein ttyV6 8:17pm 1:47 -csh 3914gierd ttyD3 10:00pm 1 elm 3915dave ttyD4 9:47pm 4 4 w 3916brent ttyp0 26Jun91 4:46 26:46 4:41 bash 3917dave ttyq4 26Jun9115days 46 46 wnewmail 3918@end group 3919@end example 3920 3921The following program takes the above input, converts the idle time to 3922number of seconds and prints out the first two fields and the calculated 3923idle time. (This program uses a number of @code{awk} features that 3924haven't been introduced yet.) 3925 3926@example 3927BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @} 3928NR > 2 @{ 3929 idle = $4 3930 sub(/^ */, "", idle) # strip leading spaces 3931 if (idle == "") 3932 idle = 0 3933@group 3934 if (idle ~ /:/) @{ 3935 split(idle, t, ":") 3936 idle = t[1] * 60 + t[2] 3937 @} 3938@end group 3939@group 3940 if (idle ~ /days/) 3941 idle *= 24 * 60 * 60 3942 3943 print $1, $2, idle 3944@} 3945@end group 3946@end example 3947 3948Here is the result of running the program on the data: 3949 3950@example 3951hzuo ttyV0 0 3952hzang ttyV3 50 3953eklye ttyV5 0 3954dportein ttyV6 107 3955gierd ttyD3 1 3956dave ttyD4 0 3957brent ttyp0 286 3958dave ttyq4 1296000 3959@end example 3960 3961Another (possibly more practical) example of fixed-width input data 3962would be the input from a deck of balloting cards. In some parts of 3963the United States, voters mark their choices by punching holes in computer 3964cards. These cards are then processed to count the votes for any particular 3965candidate or on any particular issue. Since a voter may choose not to 3966vote on some issue, any column on the card may be empty. An @code{awk} 3967program for processing such data could use the @code{FIELDWIDTHS} feature 3968to simplify reading the data. (Of course, getting @code{gawk} to run on 3969a system with card readers is another story!) 3970 3971@ignore 3972Exercise: Write a ballot card reading program 3973@end ignore 3974 3975Assigning a value to @code{FS} causes @code{gawk} to return to using 3976@code{FS} for field splitting. Use @samp{FS = FS} to make this happen, 3977without having to know the current value of @code{FS}. 3978 3979This feature is still experimental, and may evolve over time. 3980Note that in particular, @code{gawk} does not attempt to verify 3981the sanity of the values used in the value of @code{FIELDWIDTHS}. 3982 3983@node Multiple Line, Getline, Constant Size, Reading Files 3984@section Multiple-Line Records 3985 3986@cindex multiple line records 3987@cindex input, multiple line records 3988@cindex reading files, multiple line records 3989@cindex records, multiple line 3990In some data bases, a single line cannot conveniently hold all the 3991information in one entry. In such cases, you can use multi-line 3992records. 3993 3994The first step in doing this is to choose your data format: when records 3995are not defined as single lines, how do you want to define them? 3996What should separate records? 3997 3998One technique is to use an unusual character or string to separate 3999records. For example, you could use the formfeed character (written 4000@samp{\f} in @code{awk}, as in C) to separate them, making each record 4001a page of the file. To do this, just set the variable @code{RS} to 4002@code{"\f"} (a string containing the formfeed character). Any 4003other character could equally well be used, as long as it won't be part 4004of the data in a record. 4005 4006Another technique is to have blank lines separate records. By a special 4007dispensation, an empty string as the value of @code{RS} indicates that 4008records are separated by one or more blank lines. If you set @code{RS} 4009to the empty string, a record always ends at the first blank line 4010encountered. And the next record doesn't start until the first non-blank 4011line that follows---no matter how many blank lines appear in a row, they 4012are considered one record-separator. 4013 4014@cindex leftmost longest match 4015@cindex matching, leftmost longest 4016You can achieve the same effect as @samp{RS = ""} by assigning the 4017string @code{"\n\n+"} to @code{RS}. This regexp matches the newline 4018at the end of the record, and one or more blank lines after the record. 4019In addition, a regular expression always matches the longest possible 4020sequence when there is a choice 4021(@pxref{Leftmost Longest, ,How Much Text Matches?}). 4022So the next record doesn't start until 4023the first non-blank line that follows---no matter how many blank lines 4024appear in a row, they are considered one record-separator. 4025 4026@cindex dark corner 4027There is an important difference between @samp{RS = ""} and 4028@samp{RS = "\n\n+"}. In the first case, leading newlines in the input 4029data file are ignored, and if a file ends without extra blank lines 4030after the last record, the final newline is removed from the record. 4031In the second case, this special processing is not done (d.c.). 4032 4033Now that the input is separated into records, the second step is to 4034separate the fields in the record. One way to do this is to divide each 4035of the lines into fields in the normal manner. This happens by default 4036as the result of a special feature: when @code{RS} is set to the empty 4037string, the newline character @emph{always} acts as a field separator. 4038This is in addition to whatever field separations result from @code{FS}. 4039 4040The original motivation for this special exception was probably to provide 4041useful behavior in the default case (i.e.@: @code{FS} is equal 4042to @w{@code{" "}}). This feature can be a problem if you really don't 4043want the newline character to separate fields, since there is no way to 4044prevent it. However, you can work around this by using the @code{split} 4045function to break up the record manually 4046(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 4047 4048Another way to separate fields is to 4049put each field on a separate line: to do this, just set the 4050variable @code{FS} to the string @code{"\n"}. (This simple regular 4051expression matches a single newline.) 4052 4053A practical example of a data file organized this way might be a mailing 4054list, where each entry is separated by blank lines. If we have a mailing 4055list in a file named @file{addresses}, that looks like this: 4056 4057@c NEEDED 4058@page 4059@example 4060Jane Doe 4061123 Main Street 4062Anywhere, SE 12345-6789 4063 4064John Smith 4065456 Tree-lined Avenue 4066Smallville, MW 98765-4321 4067@dots{} 4068@end example 4069 4070@noindent 4071A simple program to process this file would look like this: 4072 4073@example 4074@group 4075# addrs.awk --- simple mailing list program 4076 4077# Records are separated by blank lines. 4078# Each line is one field. 4079BEGIN @{ RS = "" ; FS = "\n" @} 4080 4081@{ 4082 print "Name is:", $1 4083 print "Address is:", $2 4084 print "City and State are:", $3 4085 print "" 4086@} 4087@end group 4088@end example 4089 4090Running the program produces the following output: 4091 4092@example 4093@group 4094$ awk -f addrs.awk addresses 4095@print{} Name is: Jane Doe 4096@print{} Address is: 123 Main Street 4097@print{} City and State are: Anywhere, SE 12345-6789 4098@print{} 4099@end group 4100@group 4101@print{} Name is: John Smith 4102@print{} Address is: 456 Tree-lined Avenue 4103@print{} City and State are: Smallville, MW 98765-4321 4104@print{} 4105@dots{} 4106@end group 4107@end example 4108 4109@xref{Labels Program, ,Printing Mailing Labels}, for a more realistic 4110program that deals with address lists. 4111 4112The following table summarizes how records are split, based on the 4113value of @code{RS}. (@samp{==} means ``is equal to.'') 4114 4115@c @cartouche 4116@table @code 4117@item RS == "\n" 4118Records are separated by the newline character (@samp{\n}). In effect, 4119every line in the data file is a separate record, including blank lines. 4120This is the default. 4121 4122@item RS == @var{any single character} 4123Records are separated by each occurrence of the character. Multiple 4124successive occurrences delimit empty records. 4125 4126@item RS == "" 4127Records are separated by runs of blank lines. The newline character 4128always serves as a field separator, in addition to whatever value 4129@code{FS} may have. Leading and trailing newlines in a file are ignored. 4130 4131@item RS == @var{regexp} 4132Records are separated by occurrences of characters that match @var{regexp}. 4133Leading and trailing matches of @var{regexp} delimit empty records. 4134@end table 4135@c @end cartouche 4136 4137@vindex RT 4138In all cases, @code{gawk} sets @code{RT} to the input text that matched the 4139value specified by @code{RS}. 4140 4141@node Getline, , Multiple Line, Reading Files 4142@section Explicit Input with @code{getline} 4143 4144@findex getline 4145@cindex input, explicit 4146@cindex explicit input 4147@cindex input, @code{getline} command 4148@cindex reading files, @code{getline} command 4149So far we have been getting our input data from @code{awk}'s main 4150input stream---either the standard input (usually your terminal, sometimes 4151the output from another program) or from the 4152files specified on the command line. The @code{awk} language has a 4153special built-in command called @code{getline} that 4154can be used to read input under your explicit control. 4155 4156@menu 4157* Getline Intro:: Introduction to the @code{getline} function. 4158* Plain Getline:: Using @code{getline} with no arguments. 4159* Getline/Variable:: Using @code{getline} into a variable. 4160* Getline/File:: Using @code{getline} from a file. 4161* Getline/Variable/File:: Using @code{getline} into a variable from a 4162 file. 4163* Getline/Pipe:: Using @code{getline} from a pipe. 4164* Getline/Variable/Pipe:: Using @code{getline} into a variable from a 4165 pipe. 4166* Getline Summary:: Summary Of @code{getline} Variants. 4167@end menu 4168 4169@node Getline Intro, Plain Getline, Getline, Getline 4170@subsection Introduction to @code{getline} 4171 4172This command is used in several different ways, and should @emph{not} be 4173used by beginners. It is covered here because this is the chapter on input. 4174The examples that follow the explanation of the @code{getline} command 4175include material that has not been covered yet. Therefore, come back 4176and study the @code{getline} command @emph{after} you have reviewed the 4177rest of this @value{DOCUMENT} and have a good knowledge of how @code{awk} works. 4178 4179@vindex ERRNO 4180@cindex differences between @code{gawk} and @code{awk} 4181@cindex @code{getline}, return values 4182@code{getline} returns one if it finds a record, and zero if the end of the 4183file is encountered. If there is some error in getting a record, such 4184as a file that cannot be opened, then @code{getline} returns @minus{}1. 4185In this case, @code{gawk} sets the variable @code{ERRNO} to a string 4186describing the error that occurred. 4187 4188In the following examples, @var{command} stands for a string value that 4189represents a shell command. 4190 4191@node Plain Getline, Getline/Variable, Getline Intro, Getline 4192@subsection Using @code{getline} with No Arguments 4193 4194The @code{getline} command can be used without arguments to read input 4195from the current input file. All it does in this case is read the next 4196input record and split it up into fields. This is useful if you've 4197finished processing the current record, but you want to do some special 4198processing @emph{right now} on the next record. Here's an 4199example: 4200 4201@example 4202@group 4203awk '@{ 4204 if ((t = index($0, "/*")) != 0) @{ 4205 # value will be "" if t is 1 4206 tmp = substr($0, 1, t - 1) 4207 u = index(substr($0, t + 2), "*/") 4208 while (u == 0) @{ 4209 if (getline <= 0) @{ 4210 m = "unexpected EOF or error" 4211 m = (m ": " ERRNO) 4212 print m > "/dev/stderr" 4213 exit 4214 @} 4215 t = -1 4216 u = index($0, "*/") 4217 @} 4218@end group 4219@group 4220 # substr expression will be "" if */ 4221 # occurred at end of line 4222 $0 = tmp substr($0, t + u + 3) 4223 @} 4224 print $0 4225@}' 4226@end group 4227@end example 4228 4229This @code{awk} program deletes all C-style comments, @samp{/* @dots{} 4230*/}, from the input. By replacing the @samp{print $0} with other 4231statements, you could perform more complicated processing on the 4232decommented input, like searching for matches of a regular 4233expression. This program has a subtle problem---it does not work if one 4234comment ends and another begins on the same line. 4235 4236@ignore 4237Exercise, 4238write a program that does handle multiple comments on the line. 4239@end ignore 4240 4241This form of the @code{getline} command sets @code{NF} (the number of 4242fields; @pxref{Fields, ,Examining Fields}), @code{NR} (the number of 4243records read so far; @pxref{Records, ,How Input is Split into Records}), 4244@code{FNR} (the number of records read from this input file), and the 4245value of @code{$0}. 4246 4247@cindex dark corner 4248@strong{Note:} the new value of @code{$0} is used in testing 4249the patterns of any subsequent rules. The original value 4250of @code{$0} that triggered the rule which executed @code{getline} 4251is lost (d.c.). 4252By contrast, the @code{next} statement reads a new record 4253but immediately begins processing it normally, starting with the first 4254rule in the program. @xref{Next Statement, ,The @code{next} Statement}. 4255 4256@node Getline/Variable, Getline/File, Plain Getline, Getline 4257@subsection Using @code{getline} Into a Variable 4258 4259You can use @samp{getline @var{var}} to read the next record from 4260@code{awk}'s input into the variable @var{var}. No other processing is 4261done. 4262 4263For example, suppose the next line is a comment, or a special string, 4264and you want to read it, without triggering 4265any rules. This form of @code{getline} allows you to read that line 4266and store it in a variable so that the main 4267read-a-line-and-check-each-rule loop of @code{awk} never sees it. 4268 4269The following example swaps every two lines of input. For example, given: 4270 4271@example 4272wan 4273tew 4274free 4275phore 4276@end example 4277 4278@noindent 4279it outputs: 4280 4281@example 4282tew 4283wan 4284phore 4285free 4286@end example 4287 4288@noindent 4289Here's the program: 4290 4291@example 4292@group 4293awk '@{ 4294 if ((getline tmp) > 0) @{ 4295 print tmp 4296 print $0 4297 @} else 4298 print $0 4299@}' 4300@end group 4301@end example 4302 4303The @code{getline} command used in this way sets only the variables 4304@code{NR} and @code{FNR} (and of course, @var{var}). The record is not 4305split into fields, so the values of the fields (including @code{$0}) and 4306the value of @code{NF} do not change. 4307 4308@node Getline/File, Getline/Variable/File, Getline/Variable, Getline 4309@subsection Using @code{getline} from a File 4310 4311@cindex input redirection 4312@cindex redirection of input 4313Use @samp{getline < @var{file}} to read 4314the next record from the file 4315@var{file}. Here @var{file} is a string-valued expression that 4316specifies the file name. @samp{< @var{file}} is called a @dfn{redirection} 4317since it directs input to come from a different place. 4318 4319For example, the following 4320program reads its input record from the file @file{secondary.input} when it 4321encounters a first field with a value equal to 10 in the current input 4322file. 4323 4324@example 4325@group 4326awk '@{ 4327 if ($1 == 10) @{ 4328 getline < "secondary.input" 4329 print 4330 @} else 4331 print 4332@}' 4333@end group 4334@end example 4335 4336Since the main input stream is not used, the values of @code{NR} and 4337@code{FNR} are not changed. But the record read is split into fields in 4338the normal manner, so the values of @code{$0} and other fields are 4339changed. So is the value of @code{NF}. 4340 4341@c Thanks to Paul Eggert for initial wording here 4342According to POSIX, @samp{getline < @var{expression}} is ambiguous if 4343@var{expression} contains unparenthesized operators other than 4344@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous 4345because the concatenation operator is not parenthesized, and you should 4346write it as @samp{getline < (dir "/" file)} if you want your program 4347to be portable to other @code{awk} implementations. 4348 4349@node Getline/Variable/File, Getline/Pipe, Getline/File, Getline 4350@subsection Using @code{getline} Into a Variable from a File 4351 4352Use @samp{getline @var{var} < @var{file}} to read input 4353the file 4354@var{file} and put it in the variable @var{var}. As above, @var{file} 4355is a string-valued expression that specifies the file from which to read. 4356 4357In this version of @code{getline}, none of the built-in variables are 4358changed, and the record is not split into fields. The only variable 4359changed is @var{var}. 4360 4361@ifinfo 4362@c Thanks to Paul Eggert for initial wording here 4363According to POSIX, @samp{getline @var{var} < @var{expression}} is ambiguous if 4364@var{expression} contains unparenthesized operators other than 4365@samp{$}; for example, @samp{getline < dir "/" file} is ambiguous 4366because the concatenation operator is not parenthesized, and you should 4367write it as @samp{getline < (dir "/" file)} if you want your program 4368to be portable to other @code{awk} implementations. 4369@end ifinfo 4370 4371For example, the following program copies all the input files to the 4372output, except for records that say @w{@samp{@@include @var{filename}}}. 4373Such a record is replaced by the contents of the file 4374@var{filename}. 4375 4376@example 4377@group 4378awk '@{ 4379 if (NF == 2 && $1 == "@@include") @{ 4380 while ((getline line < $2) > 0) 4381 print line 4382 close($2) 4383 @} else 4384 print 4385@}' 4386@end group 4387@end example 4388 4389Note here how the name of the extra input file is not built into 4390the program; it is taken directly from the data, from the second field on 4391the @samp{@@include} line. 4392 4393The @code{close} function is called to ensure that if two identical 4394@samp{@@include} lines appear in the input, the entire specified file is 4395included twice. 4396@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}. 4397 4398One deficiency of this program is that it does not process nested 4399@samp{@@include} statements 4400(@samp{@@include} statements in included files) 4401the way a true macro preprocessor would. 4402@xref{Igawk Program, ,An Easy Way to Use Library Functions}, for a program 4403that does handle nested @samp{@@include} statements. 4404 4405@node Getline/Pipe, Getline/Variable/Pipe, Getline/Variable/File, Getline 4406@subsection Using @code{getline} from a Pipe 4407 4408@cindex input pipeline 4409@cindex pipeline, input 4410You can pipe the output of a command into @code{getline}, using 4411@samp{@var{command} | getline}. In 4412this case, the string @var{command} is run as a shell command and its output 4413is piped into @code{awk} to be used as input. This form of @code{getline} 4414reads one record at a time from the pipe. 4415 4416For example, the following program copies its input to its output, except for 4417lines that begin with @samp{@@execute}, which are replaced by the output 4418produced by running the rest of the line as a shell command: 4419 4420@example 4421@group 4422awk '@{ 4423 if ($1 == "@@execute") @{ 4424 tmp = substr($0, 10) 4425 while ((tmp | getline) > 0) 4426 print 4427 close(tmp) 4428 @} else 4429 print 4430@}' 4431@end group 4432@end example 4433 4434@noindent 4435The @code{close} function is called to ensure that if two identical 4436@samp{@@execute} lines appear in the input, the command is run for 4437each one. 4438@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}. 4439@c Exercise!! 4440@c This example is unrealistic, since you could just use system 4441 4442Given the input: 4443 4444@example 4445@group 4446foo 4447bar 4448baz 4449@@execute who 4450bletch 4451@end group 4452@end example 4453 4454@noindent 4455the program might produce: 4456 4457@example 4458@group 4459foo 4460bar 4461baz 4462arnold ttyv0 Jul 13 14:22 4463miriam ttyp0 Jul 13 14:23 (murphy:0) 4464bill ttyp1 Jul 13 14:23 (murphy:0) 4465bletch 4466@end group 4467@end example 4468 4469@noindent 4470Notice that this program ran the command @code{who} and printed the result. 4471(If you try this program yourself, you will of course get different results, 4472showing you who is logged in on your system.) 4473 4474This variation of @code{getline} splits the record into fields, sets the 4475value of @code{NF} and recomputes the value of @code{$0}. The values of 4476@code{NR} and @code{FNR} are not changed. 4477 4478@c Thanks to Paul Eggert for initial wording here 4479According to POSIX, @samp{@var{expression} | getline} is ambiguous if 4480@var{expression} contains unparenthesized operators other than 4481@samp{$}; for example, @samp{"echo " "date" | getline} is ambiguous 4482because the concatenation operator is not parenthesized, and you should 4483write it as @samp{("echo " "date") | getline} if you want your program 4484to be portable to other @code{awk} implementations. 4485(It happens that @code{gawk} gets it right, but you should not 4486rely on this. Parentheses make it easier to read, anyway.) 4487 4488@node Getline/Variable/Pipe, Getline Summary, Getline/Pipe, Getline 4489@subsection Using @code{getline} Into a Variable from a Pipe 4490 4491When you use @samp{@var{command} | getline @var{var}}, the 4492output of the command @var{command} is sent through a pipe to 4493@code{getline} and into the variable @var{var}. For example, the 4494following program reads the current date and time into the variable 4495@code{current_time}, using the @code{date} utility, and then 4496prints it. 4497 4498@example 4499@group 4500awk 'BEGIN @{ 4501 "date" | getline current_time 4502 close("date") 4503 print "Report printed on " current_time 4504@}' 4505@end group 4506@end example 4507 4508In this version of @code{getline}, none of the built-in variables are 4509changed, and the record is not split into fields. 4510 4511@ifinfo 4512@c Thanks to Paul Eggert for initial wording here 4513According to POSIX, @samp{@var{expression} | getline @var{var}} is ambiguous if 4514@var{expression} contains unparenthesized operators other than 4515@samp{$}; for example, @samp{"echo " "date" | getline @var{var}} is ambiguous 4516because the concatenation operator is not parenthesized, and you should 4517write it as @samp{("echo " "date") | getline @var{var}} if you want your 4518program to be portable to other @code{awk} implementations. 4519(It happens that @code{gawk} gets it right, but you should not 4520rely on this. Parentheses make it easier to read, anyway.) 4521@end ifinfo 4522 4523@node Getline Summary, , Getline/Variable/Pipe, Getline 4524@subsection Summary of @code{getline} Variants 4525 4526With all the forms of @code{getline}, even though @code{$0} and @code{NF}, 4527may be updated, the record will not be tested against all the patterns 4528in the @code{awk} program, in the way that would happen if the record 4529were read normally by the main processing loop of @code{awk}. However 4530the new record is tested against any subsequent rules. 4531 4532@cindex differences between @code{gawk} and @code{awk} 4533@cindex limitations 4534@cindex implementation limits 4535Many @code{awk} implementations limit the number of pipelines an @code{awk} 4536program may have open to just one! In @code{gawk}, there is no such limit. 4537You can open as many pipelines as the underlying operating system will 4538permit. 4539 4540@vindex FILENAME 4541@cindex dark corner 4542@cindex @code{getline}, setting @code{FILENAME} 4543@cindex @code{FILENAME}, being set by @code{getline} 4544An interesting side-effect occurs if you use @code{getline} (without a 4545redirection) inside a @code{BEGIN} rule. Since an unredirected @code{getline} 4546reads from the command line data files, the first @code{getline} command 4547causes @code{awk} to set the value of @code{FILENAME}. Normally, 4548@code{FILENAME} does not have a value inside @code{BEGIN} rules, since you 4549have not yet started to process the command line data files (d.c.). 4550(@xref{BEGIN/END, , The @code{BEGIN} and @code{END} Special Patterns}, 4551also @pxref{Auto-set, , Built-in Variables that Convey Information}.) 4552 4553The following table summarizes the six variants of @code{getline}, 4554listing which built-in variables are set by each one. 4555 4556@c @cartouche 4557@table @code 4558@item getline 4559sets @code{$0}, @code{NF}, @code{FNR}, and @code{NR}. 4560 4561@item getline @var{var} 4562sets @var{var}, @code{FNR}, and @code{NR}. 4563 4564@item getline < @var{file} 4565sets @code{$0}, and @code{NF}. 4566 4567@item getline @var{var} < @var{file} 4568sets @var{var}. 4569 4570@item @var{command} | getline 4571sets @code{$0}, and @code{NF}. 4572 4573@item @var{command} | getline @var{var} 4574sets @var{var}. 4575@end table 4576@c @end cartouche 4577 4578@node Printing, Expressions, Reading Files, Top 4579@chapter Printing Output 4580 4581@cindex printing 4582@cindex output 4583One of the most common actions is to @dfn{print}, or output, 4584some or all of the input. You use the @code{print} statement 4585for simple output. You use the @code{printf} statement 4586for fancier formatting. Both are described in this chapter. 4587 4588@menu 4589* Print:: The @code{print} statement. 4590* Print Examples:: Simple examples of @code{print} statements. 4591* Output Separators:: The output separators and how to change them. 4592* OFMT:: Controlling Numeric Output With @code{print}. 4593* Printf:: The @code{printf} statement. 4594* Redirection:: How to redirect output to multiple files and 4595 pipes. 4596* Special Files:: File name interpretation in @code{gawk}. 4597 @code{gawk} allows access to inherited file 4598 descriptors. 4599* Close Files And Pipes:: Closing Input and Output Files and Pipes. 4600@end menu 4601 4602@node Print, Print Examples, Printing, Printing 4603@section The @code{print} Statement 4604@cindex @code{print} statement 4605 4606The @code{print} statement does output with simple, standardized 4607formatting. You specify only the strings or numbers to be printed, in a 4608list separated by commas. They are output, separated by single spaces, 4609followed by a newline. The statement looks like this: 4610 4611@example 4612print @var{item1}, @var{item2}, @dots{} 4613@end example 4614 4615@noindent 4616The entire list of items may optionally be enclosed in parentheses. The 4617parentheses are necessary if any of the item expressions uses the @samp{>} 4618relational operator; otherwise it could be confused with a redirection 4619(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}). 4620 4621The items to be printed can be constant strings or numbers, fields of the 4622current record (such as @code{$1}), variables, or any @code{awk} 4623expressions. 4624Numeric values are converted to strings, and then printed. 4625 4626The @code{print} statement is completely general for 4627computing @emph{what} values to print. However, with two exceptions, 4628you cannot specify @emph{how} to print them---how many 4629columns, whether to use exponential notation or not, and so on. 4630(For the exceptions, @pxref{Output Separators}, and 4631@ref{OFMT, ,Controlling Numeric Output with @code{print}}.) 4632For that, you need the @code{printf} statement 4633(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}). 4634 4635The simple statement @samp{print} with no items is equivalent to 4636@samp{print $0}: it prints the entire current record. To print a blank 4637line, use @samp{print ""}, where @code{""} is the empty string. 4638 4639To print a fixed piece of text, use a string constant such as 4640@w{@code{"Don't Panic"}} as one item. If you forget to use the 4641double-quote characters, your text will be taken as an @code{awk} 4642expression, and you will probably get an error. Keep in mind that a 4643space is printed between any two items. 4644 4645Each @code{print} statement makes at least one line of output. But it 4646isn't limited to one line. If an item value is a string that contains a 4647newline, the newline is output along with the rest of the string. A 4648single @code{print} can make any number of lines this way. 4649 4650@node Print Examples, Output Separators, Print, Printing 4651@section Examples of @code{print} Statements 4652 4653Here is an example of printing a string that contains embedded newlines 4654(the @samp{\n} is an escape sequence, used to represent the newline 4655character; @pxref{Escape Sequences}): 4656 4657@example 4658@group 4659$ awk 'BEGIN @{ print "line one\nline two\nline three" @}' 4660@print{} line one 4661@print{} line two 4662@print{} line three 4663@end group 4664@end example 4665 4666Here is an example that prints the first two fields of each input record, 4667with a space between them: 4668 4669@example 4670@group 4671$ awk '@{ print $1, $2 @}' inventory-shipped 4672@print{} Jan 13 4673@print{} Feb 15 4674@print{} Mar 15 4675@dots{} 4676@end group 4677@end example 4678 4679@cindex common mistakes 4680@cindex mistakes, common 4681@cindex errors, common 4682A common mistake in using the @code{print} statement is to omit the comma 4683between two items. This often has the effect of making the items run 4684together in the output, with no space. The reason for this is that 4685juxtaposing two string expressions in @code{awk} means to concatenate 4686them. Here is the same program, without the comma: 4687 4688@example 4689@group 4690$ awk '@{ print $1 $2 @}' inventory-shipped 4691@print{} Jan13 4692@print{} Feb15 4693@print{} Mar15 4694@dots{} 4695@end group 4696@end example 4697 4698To someone unfamiliar with the file @file{inventory-shipped}, neither 4699example's output makes much sense. A heading line at the beginning 4700would make it clearer. Let's add some headings to our table of months 4701(@code{$1}) and green crates shipped (@code{$2}). We do this using the 4702@code{BEGIN} pattern 4703(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}) 4704to force the headings to be printed only once: 4705 4706@example 4707awk 'BEGIN @{ print "Month Crates" 4708 print "----- ------" @} 4709 @{ print $1, $2 @}' inventory-shipped 4710@end example 4711 4712@noindent 4713Did you already guess what happens? When run, the program prints 4714the following: 4715 4716@example 4717@group 4718Month Crates 4719----- ------ 4720Jan 13 4721Feb 15 4722Mar 15 4723@dots{} 4724@end group 4725@end example 4726 4727@noindent 4728The headings and the table data don't line up! We can fix this by printing 4729some spaces between the two fields: 4730 4731@example 4732awk 'BEGIN @{ print "Month Crates" 4733 print "----- ------" @} 4734 @{ print $1, " ", $2 @}' inventory-shipped 4735@end example 4736 4737You can imagine that this way of lining up columns can get pretty 4738complicated when you have many columns to fix. Counting spaces for two 4739or three columns can be simple, but more than this and you can get 4740lost quite easily. This is why the @code{printf} statement was 4741created (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}); 4742one of its specialties is lining up columns of data. 4743 4744@cindex line continuation 4745As a side point, 4746you can continue either a @code{print} or @code{printf} statement simply 4747by putting a newline after any comma 4748(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}). 4749 4750@node Output Separators, OFMT, Print Examples, Printing 4751@section Output Separators 4752 4753@cindex output field separator, @code{OFS} 4754@cindex output record separator, @code{ORS} 4755@vindex OFS 4756@vindex ORS 4757As mentioned previously, a @code{print} statement contains a list 4758of items, separated by commas. In the output, the items are normally 4759separated by single spaces. This need not be the case; a 4760single space is only the default. You can specify any string of 4761characters to use as the @dfn{output field separator} by setting the 4762built-in variable @code{OFS}. The initial value of this variable 4763is the string @w{@code{" "}}, that is, a single space. 4764 4765The output from an entire @code{print} statement is called an 4766@dfn{output record}. Each @code{print} statement outputs one output 4767record and then outputs a string called the @dfn{output record separator}. 4768The built-in variable @code{ORS} specifies this string. The initial 4769value of @code{ORS} is the string @code{"\n"}, i.e.@: a newline 4770character; thus, normally each @code{print} statement makes a separate line. 4771 4772You can change how output fields and records are separated by assigning 4773new values to the variables @code{OFS} and/or @code{ORS}. The usual 4774place to do this is in the @code{BEGIN} rule 4775(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), so 4776that it happens before any input is processed. You may also do this 4777with assignments on the command line, before the names of your input 4778files, or using the @samp{-v} command line option 4779(@pxref{Options, ,Command Line Options}). 4780 4781@ignore 4782Exercise, 4783Rewrite the 4784@example 4785awk 'BEGIN @{ print "Month Crates" 4786 print "----- ------" @} 4787 @{ print $1, " ", $2 @}' inventory-shipped 4788@end example 4789program by using a new value of @code{OFS}. 4790@end ignore 4791 4792The following example prints the first and second fields of each input 4793record separated by a semicolon, with a blank line added after each 4794line: 4795 4796@example 4797@group 4798$ awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @} 4799> @{ print $1, $2 @}' BBS-list 4800@print{} aardvark;555-5553 4801@print{} 4802@print{} alpo-net;555-3412 4803@print{} 4804@print{} barfly;555-7685 4805@dots{} 4806@end group 4807@end example 4808 4809If the value of @code{ORS} does not contain a newline, all your output 4810will be run together on a single line, unless you output newlines some 4811other way. 4812 4813@node OFMT, Printf, Output Separators, Printing 4814@section Controlling Numeric Output with @code{print} 4815@vindex OFMT 4816@cindex numeric output format 4817@cindex format, numeric output 4818@cindex output format specifier, @code{OFMT} 4819When you use the @code{print} statement to print numeric values, 4820@code{awk} internally converts the number to a string of characters, 4821and prints that string. @code{awk} uses the @code{sprintf} function 4822to do this conversion 4823(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 4824For now, it suffices to say that the @code{sprintf} 4825function accepts a @dfn{format specification} that tells it how to format 4826numbers (or strings), and that there are a number of different ways in which 4827numbers can be formatted. The different format specifications are discussed 4828more fully in 4829@ref{Control Letters, , Format-Control Letters}. 4830 4831The built-in variable @code{OFMT} contains the default format specification 4832that @code{print} uses with @code{sprintf} when it wants to convert a 4833number to a string for printing. 4834The default value of @code{OFMT} is @code{"%.6g"}. 4835By supplying different format specifications 4836as the value of @code{OFMT}, you can change how @code{print} will print 4837your numbers. As a brief example: 4838 4839@example 4840@group 4841$ awk 'BEGIN @{ 4842> OFMT = "%.0f" # print numbers as integers (rounds) 4843> print 17.23 @}' 4844@print{} 17 4845@end group 4846@end example 4847 4848@noindent 4849@cindex dark corner 4850@cindex @code{awk} language, POSIX version 4851@cindex POSIX @code{awk} 4852According to the POSIX standard, @code{awk}'s behavior will be undefined 4853if @code{OFMT} contains anything but a floating point conversion specification 4854(d.c.). 4855 4856@node Printf, Redirection, OFMT, Printing 4857@section Using @code{printf} Statements for Fancier Printing 4858@cindex formatted output 4859@cindex output, formatted 4860 4861If you want more precise control over the output format than 4862@code{print} gives you, use @code{printf}. With @code{printf} you can 4863specify the width to use for each item, and you can specify various 4864formatting choices for numbers (such as what radix to use, whether to 4865print an exponent, whether to print a sign, and how many digits to print 4866after the decimal point). You do this by supplying a string, called 4867the @dfn{format string}, which controls how and where to print the other 4868arguments. 4869 4870@menu 4871* Basic Printf:: Syntax of the @code{printf} statement. 4872* Control Letters:: Format-control letters. 4873* Format Modifiers:: Format-specification modifiers. 4874* Printf Examples:: Several examples. 4875@end menu 4876 4877@node Basic Printf, Control Letters, Printf, Printf 4878@subsection Introduction to the @code{printf} Statement 4879 4880@cindex @code{printf} statement, syntax of 4881The @code{printf} statement looks like this: 4882 4883@example 4884printf @var{format}, @var{item1}, @var{item2}, @dots{} 4885@end example 4886 4887@noindent 4888The entire list of arguments may optionally be enclosed in parentheses. The 4889parentheses are necessary if any of the item expressions use the @samp{>} 4890relational operator; otherwise it could be confused with a redirection 4891(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}). 4892 4893@cindex format string 4894The difference between @code{printf} and @code{print} is the @var{format} 4895argument. This is an expression whose value is taken as a string; it 4896specifies how to output each of the other arguments. It is called 4897the @dfn{format string}. 4898 4899The format string is very similar to that in the ANSI C library function 4900@code{printf}. Most of @var{format} is text to be output verbatim. 4901Scattered among this text are @dfn{format specifiers}, one per item. 4902Each format specifier says to output the next item in the argument list 4903at that place in the format. 4904 4905The @code{printf} statement does not automatically append a newline to its 4906output. It outputs only what the format string specifies. So if you want 4907a newline, you must include one in the format string. The output separator 4908variables @code{OFS} and @code{ORS} have no effect on @code{printf} 4909statements. For example: 4910 4911@example 4912@group 4913BEGIN @{ 4914 ORS = "\nOUCH!\n"; OFS = "!" 4915 msg = "Don't Panic!"; printf "%s\n", msg 4916@} 4917@end group 4918@end example 4919 4920This program still prints the familiar @samp{Don't Panic!} message. 4921 4922@node Control Letters, Format Modifiers, Basic Printf, Printf 4923@subsection Format-Control Letters 4924@cindex @code{printf}, format-control characters 4925@cindex format specifier 4926 4927A format specifier starts with the character @samp{%} and ends with a 4928@dfn{format-control letter}; it tells the @code{printf} statement how 4929to output one item. (If you actually want to output a @samp{%}, write 4930@samp{%%}.) The format-control letter specifies what kind of value to 4931print. The rest of the format specifier is made up of optional 4932@dfn{modifiers} which are parameters to use, such as the field width. 4933 4934Here is a list of the format-control letters: 4935 4936@table @code 4937@item c 4938This prints a number as an ASCII character. Thus, @samp{printf "%c", 493965} outputs the letter @samp{A}. The output for a string value is 4940the first character of the string. 4941 4942@item d 4943@itemx i 4944These are equivalent. They both print a decimal integer. 4945The @samp{%i} specification is for compatibility with ANSI C. 4946 4947@item e 4948@itemx E 4949This prints a number in scientific (exponential) notation. 4950For example, 4951 4952@example 4953printf "%4.3e\n", 1950 4954@end example 4955 4956@noindent 4957prints @samp{1.950e+03}, with a total of four significant figures of 4958which three follow the decimal point. The @samp{4.3} are modifiers, 4959discussed below. @samp{%E} uses @samp{E} instead of @samp{e} in the output. 4960 4961@item f 4962This prints a number in floating point notation. 4963For example, 4964 4965@example 4966printf "%4.3f", 1950 4967@end example 4968 4969@noindent 4970prints @samp{1950.000}, with a total of four significant figures of 4971which three follow the decimal point. The @samp{4.3} are modifiers, 4972discussed below. 4973 4974@item g 4975@itemx G 4976This prints a number in either scientific notation or floating point 4977notation, whichever uses fewer characters. If the result is printed in 4978scientific notation, @samp{%G} uses @samp{E} instead of @samp{e}. 4979 4980@item o 4981This prints an unsigned octal integer. 4982(In octal, or base-eight notation, the digits run from @samp{0} to @samp{7}; 4983the decimal number eight is represented as @samp{10} in octal.) 4984 4985@item s 4986This prints a string. 4987 4988@item u 4989This prints an unsigned decimal number. 4990(This format is of marginal use, since all numbers in @code{awk} 4991are floating point. It is provided primarily for compatibility 4992with C.) 4993 4994@item x 4995@itemx X 4996This prints an unsigned hexadecimal integer. 4997(In hexadecimal, or base-16 notation, the digits are @samp{0} through @samp{9} 4998and @samp{a} through @samp{f}. The hexadecimal digit @samp{f} represents 4999the decimal number 15.) @samp{%X} uses the letters @samp{A} through @samp{F} 5000instead of @samp{a} through @samp{f}. 5001 5002@item % 5003This isn't really a format-control letter, but it does have a meaning 5004when used after a @samp{%}: the sequence @samp{%%} outputs one 5005@samp{%}. It does not consume an argument, and it ignores any 5006modifiers. 5007@end table 5008 5009@cindex dark corner 5010When using the integer format-control letters for values that are outside 5011the range of a C @code{long} integer, @code{gawk} will switch to the 5012@samp{%g} format specifier. Other versions of @code{awk} may print 5013invalid values, or do something else entirely (d.c.). 5014 5015@node Format Modifiers, Printf Examples, Control Letters, Printf 5016@subsection Modifiers for @code{printf} Formats 5017 5018@cindex @code{printf}, modifiers 5019@cindex modifiers (in format specifiers) 5020A format specification can also include @dfn{modifiers} that can control 5021how much of the item's value is printed and how much space it gets. The 5022modifiers come between the @samp{%} and the format-control letter. 5023In the examples below, we use the bullet symbol ``@bullet{}'' to represent 5024spaces in the output. Here are the possible modifiers, in the order in 5025which they may appear: 5026 5027@table @code 5028@item - 5029The minus sign, used before the width modifier (see below), 5030says to left-justify 5031the argument within its specified width. Normally the argument 5032is printed right-justified in the specified width. Thus, 5033 5034@example 5035printf "%-4s", "foo" 5036@end example 5037 5038@noindent 5039prints @samp{foo@bullet{}}. 5040 5041@item @var{space} 5042For numeric conversions, prefix positive values with a space, and 5043negative values with a minus sign. 5044 5045@item + 5046The plus sign, used before the width modifier (see below), 5047says to always supply a sign for numeric conversions, even if the data 5048to be formatted is positive. The @samp{+} overrides the space modifier. 5049 5050@item # 5051Use an ``alternate form'' for certain control letters. 5052For @samp{%o}, supply a leading zero. 5053For @samp{%x}, and @samp{%X}, supply a leading @samp{0x} or @samp{0X} for 5054a non-zero result. 5055For @samp{%e}, @samp{%E}, and @samp{%f}, the result will always contain a 5056decimal point. 5057For @samp{%g}, and @samp{%G}, trailing zeros are not removed from the result. 5058 5059@cindex dark corner 5060@item 0 5061A leading @samp{0} (zero) acts as a flag, that indicates output should be 5062padded with zeros instead of spaces. 5063This applies even to non-numeric output formats (d.c.). 5064This flag only has an effect when the field width is wider than the 5065value to be printed. 5066 5067@item @var{width} 5068This is a number specifying the desired minimum width of a field. Inserting any 5069number between the @samp{%} sign and the format control character forces the 5070field to be expanded to this width. The default way to do this is to 5071pad with spaces on the left. For example, 5072 5073@example 5074printf "%4s", "foo" 5075@end example 5076 5077@noindent 5078prints @samp{@bullet{}foo}. 5079 5080The value of @var{width} is a minimum width, not a maximum. If the item 5081value requires more than @var{width} characters, it can be as wide as 5082necessary. Thus, 5083 5084@example 5085printf "%4s", "foobar" 5086@end example 5087 5088@noindent 5089prints @samp{foobar}. 5090 5091Preceding the @var{width} with a minus sign causes the output to be 5092padded with spaces on the right, instead of on the left. 5093 5094@item .@var{prec} 5095This is a number that specifies the precision to use when printing. 5096For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the 5097number of digits you want printed to the right of the decimal point. 5098For the @samp{g}, and @samp{G} formats, it specifies the maximum number 5099of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u}, 5100@samp{x}, and @samp{X} formats, it specifies the minimum number of 5101digits to print. For a string, it specifies the maximum number of 5102characters from the string that should be printed. Thus, 5103 5104@example 5105printf "%.4s", "foobar" 5106@end example 5107 5108@noindent 5109prints @samp{foob}. 5110@end table 5111 5112The C library @code{printf}'s dynamic @var{width} and @var{prec} 5113capability (for example, @code{"%*.*s"}) is supported. Instead of 5114supplying explicit @var{width} and/or @var{prec} values in the format 5115string, you pass them in the argument list. For example: 5116 5117@example 5118w = 5 5119p = 3 5120s = "abcdefg" 5121printf "%*.*s\n", w, p, s 5122@end example 5123 5124@noindent 5125is exactly equivalent to 5126 5127@example 5128s = "abcdefg" 5129printf "%5.3s\n", s 5130@end example 5131 5132@noindent 5133Both programs output @samp{@w{@bullet{}@bullet{}abc}}. 5134 5135Earlier versions of @code{awk} did not support this capability. 5136If you must use such a version, you may simulate this feature by using 5137concatenation to build up the format string, like so: 5138 5139@example 5140w = 5 5141p = 3 5142s = "abcdefg" 5143printf "%" w "." p "s\n", s 5144@end example 5145 5146@noindent 5147This is not particularly easy to read, but it does work. 5148 5149@cindex @code{awk} language, POSIX version 5150@cindex POSIX @code{awk} 5151C programmers may be used to supplying additional @samp{l} and @samp{h} 5152flags in @code{printf} format strings. These are not valid in @code{awk}. 5153Most @code{awk} implementations silently ignore these flags. 5154If @samp{--lint} is provided on the command line 5155(@pxref{Options, ,Command Line Options}), 5156@code{gawk} will warn about their use. If @samp{--posix} is supplied, 5157their use is a fatal error. 5158 5159@node Printf Examples, , Format Modifiers, Printf 5160@subsection Examples Using @code{printf} 5161 5162Here is how to use @code{printf} to make an aligned table: 5163 5164@example 5165awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list 5166@end example 5167 5168@noindent 5169prints the names of bulletin boards (@code{$1}) of the file 5170@file{BBS-list} as a string of 10 characters, left justified. It also 5171prints the phone numbers (@code{$2}) afterward on the line. This 5172produces an aligned two-column table of names and phone numbers: 5173 5174@example 5175@group 5176$ awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list 5177@print{} aardvark 555-5553 5178@print{} alpo-net 555-3412 5179@print{} barfly 555-7685 5180@print{} bites 555-1675 5181@print{} camelot 555-0542 5182@print{} core 555-2912 5183@print{} fooey 555-1234 5184@print{} foot 555-6699 5185@print{} macfoo 555-6480 5186@print{} sdace 555-3430 5187@print{} sabafoo 555-2127 5188@end group 5189@end example 5190 5191Did you notice that we did not specify that the phone numbers be printed 5192as numbers? They had to be printed as strings because the numbers are 5193separated by a dash. 5194If we had tried to print the phone numbers as numbers, all we would have 5195gotten would have been the first three digits, @samp{555}. 5196This would have been pretty confusing. 5197 5198We did not specify a width for the phone numbers because they are the 5199last things on their lines. We don't need to put spaces after them. 5200 5201We could make our table look even nicer by adding headings to the tops 5202of the columns. To do this, we use the @code{BEGIN} pattern 5203(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}) 5204to force the header to be printed only once, at the beginning of 5205the @code{awk} program: 5206 5207@example 5208@group 5209awk 'BEGIN @{ print "Name Number" 5210 print "---- ------" @} 5211 @{ printf "%-10s %s\n", $1, $2 @}' BBS-list 5212@end group 5213@end example 5214 5215Did you notice that we mixed @code{print} and @code{printf} statements in 5216the above example? We could have used just @code{printf} statements to get 5217the same results: 5218 5219@example 5220@group 5221awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number" 5222 printf "%-10s %s\n", "----", "------" @} 5223 @{ printf "%-10s %s\n", $1, $2 @}' BBS-list 5224@end group 5225@end example 5226 5227@noindent 5228By printing each column heading with the same format specification 5229used for the elements of the column, we have made sure that the headings 5230are aligned just like the columns. 5231 5232The fact that the same format specification is used three times can be 5233emphasized by storing it in a variable, like this: 5234 5235@example 5236@group 5237awk 'BEGIN @{ format = "%-10s %s\n" 5238 printf format, "Name", "Number" 5239 printf format, "----", "------" @} 5240 @{ printf format, $1, $2 @}' BBS-list 5241@end group 5242@end example 5243 5244@c !!! exercise 5245See if you can use the @code{printf} statement to line up the headings and 5246table data for our @file{inventory-shipped} example covered earlier in the 5247section on the @code{print} statement 5248(@pxref{Print, ,The @code{print} Statement}). 5249 5250@node Redirection, Special Files, Printf, Printing 5251@section Redirecting Output of @code{print} and @code{printf} 5252 5253@cindex output redirection 5254@cindex redirection of output 5255So far we have been dealing only with output that prints to the standard 5256output, usually your terminal. Both @code{print} and @code{printf} can 5257also send their output to other places. 5258This is called @dfn{redirection}. 5259 5260A redirection appears after the @code{print} or @code{printf} statement. 5261Redirections in @code{awk} are written just like redirections in shell 5262commands, except that they are written inside the @code{awk} program. 5263 5264There are three forms of output redirection: output to a file, 5265output appended to a file, and output through a pipe to another 5266command. 5267They are all shown for 5268the @code{print} statement, but they work identically for @code{printf} 5269also. 5270 5271@table @code 5272@item print @var{items} > @var{output-file} 5273This type of redirection prints the items into the output file 5274@var{output-file}. The file name @var{output-file} can be any 5275expression. Its value is changed to a string and then used as a 5276file name (@pxref{Expressions}). 5277 5278When this type of redirection is used, the @var{output-file} is erased 5279before the first output is written to it. Subsequent writes 5280to the same @var{output-file} do not 5281erase @var{output-file}, but append to it. If @var{output-file} does 5282not exist, then it is created. 5283 5284For example, here is how an @code{awk} program can write a list of 5285BBS names to a file @file{name-list} and a list of phone numbers to a 5286file @file{phone-list}. Each output file contains one name or number 5287per line. 5288 5289@example 5290@group 5291$ awk '@{ print $2 > "phone-list" 5292> print $1 > "name-list" @}' BBS-list 5293@end group 5294@group 5295$ cat phone-list 5296@print{} 555-5553 5297@print{} 555-3412 5298@dots{} 5299@end group 5300@group 5301$ cat name-list 5302@print{} aardvark 5303@print{} alpo-net 5304@dots{} 5305@end group 5306@end example 5307 5308@item print @var{items} >> @var{output-file} 5309This type of redirection prints the items into the pre-existing output file 5310@var{output-file}. The difference between this and the 5311single-@samp{>} redirection is that the old contents (if any) of 5312@var{output-file} are not erased. Instead, the @code{awk} output is 5313appended to the file. 5314If @var{output-file} does not exist, then it is created. 5315 5316@cindex pipes for output 5317@cindex output, piping 5318@item print @var{items} | @var{command} 5319It is also possible to send output to another program through a pipe 5320instead of into a 5321file. This type of redirection opens a pipe to @var{command} and writes 5322the values of @var{items} through this pipe, to another process created 5323to execute @var{command}. 5324 5325The redirection argument @var{command} is actually an @code{awk} 5326expression. Its value is converted to a string, whose contents give the 5327shell command to be run. 5328 5329For example, this produces two files, one unsorted list of BBS names 5330and one list sorted in reverse alphabetical order: 5331 5332@example 5333awk '@{ print $1 > "names.unsorted" 5334 command = "sort -r > names.sorted" 5335 print $1 | command @}' BBS-list 5336@end example 5337 5338Here the unsorted list is written with an ordinary redirection while 5339the sorted list is written by piping through the @code{sort} utility. 5340 5341This example uses redirection to mail a message to a mailing 5342list @samp{bug-system}. This might be useful when trouble is encountered 5343in an @code{awk} script run periodically for system maintenance. 5344 5345@example 5346report = "mail bug-system" 5347print "Awk script failed:", $0 | report 5348m = ("at record number " FNR " of " FILENAME) 5349print m | report 5350close(report) 5351@end example 5352 5353The message is built using string concatenation and saved in the variable 5354@code{m}. It is then sent down the pipeline to the @code{mail} program. 5355 5356We call the @code{close} function here because it's a good idea to close 5357the pipe as soon as all the intended output has been sent to it. 5358@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}, 5359for more information 5360on this. This example also illustrates the use of a variable to represent 5361a @var{file} or @var{command}: it is not necessary to always 5362use a string constant. Using a variable is generally a good idea, 5363since @code{awk} requires you to spell the string value identically 5364every time. 5365@end table 5366 5367Redirecting output using @samp{>}, @samp{>>}, or @samp{|} asks the system 5368to open a file or pipe only if the particular @var{file} or @var{command} 5369you've specified has not already been written to by your program, or if 5370it has been closed since it was last written to. 5371 5372@cindex differences between @code{gawk} and @code{awk} 5373@cindex limitations 5374@cindex implementation limits 5375@iftex 5376As mentioned earlier 5377(@pxref{Getline Summary, , Summary of @code{getline} Variants}), 5378many 5379@end iftex 5380@ifinfo 5381Many 5382@end ifinfo 5383@code{awk} implementations limit the number of pipelines an @code{awk} 5384program may have open to just one! In @code{gawk}, there is no such limit. 5385You can open as many pipelines as the underlying operating system will 5386permit. 5387 5388@node Special Files, Close Files And Pipes , Redirection, Printing 5389@section Special File Names in @code{gawk} 5390@cindex standard input 5391@cindex standard output 5392@cindex standard error output 5393@cindex file descriptors 5394 5395Running programs conventionally have three input and output streams 5396already available to them for reading and writing. These are known as 5397the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error 5398output}. These streams are, by default, connected to your terminal, but 5399they are often redirected with the shell, via the @samp{<}, @samp{<<}, 5400@samp{>}, @samp{>>}, @samp{>&} and @samp{|} operators. Standard error 5401is typically used for writing error messages; the reason we have two separate 5402streams, standard output and standard error, is so that they can be 5403redirected separately. 5404 5405@cindex differences between @code{gawk} and @code{awk} 5406In other implementations of @code{awk}, the only way to write an error 5407message to standard error in an @code{awk} program is as follows: 5408 5409@example 5410print "Serious error detected!" | "cat 1>&2" 5411@end example 5412 5413@noindent 5414This works by opening a pipeline to a shell command which can access the 5415standard error stream which it inherits from the @code{awk} process. 5416This is far from elegant, and is also inefficient, since it requires a 5417separate process. So people writing @code{awk} programs often 5418neglect to do this. Instead, they send the error messages to the 5419terminal, like this: 5420 5421@example 5422@group 5423print "Serious error detected!" > "/dev/tty" 5424@end group 5425@end example 5426 5427@noindent 5428This usually has the same effect, but not always: although the 5429standard error stream is usually the terminal, it can be redirected, and 5430when that happens, writing to the terminal is not correct. In fact, if 5431@code{awk} is run from a background job, it may not have a terminal at all. 5432Then opening @file{/dev/tty} will fail. 5433 5434@code{gawk} provides special file names for accessing the three standard 5435streams. When you redirect input or output in @code{gawk}, if the file name 5436matches one of these special names, then @code{gawk} directly uses the 5437stream it stands for. 5438 5439@cindex @file{/dev/stdin} 5440@cindex @file{/dev/stdout} 5441@cindex @file{/dev/stderr} 5442@cindex @file{/dev/fd} 5443@c @cartouche 5444@table @file 5445@item /dev/stdin 5446The standard input (file descriptor 0). 5447 5448@item /dev/stdout 5449The standard output (file descriptor 1). 5450 5451@item /dev/stderr 5452The standard error output (file descriptor 2). 5453 5454@item /dev/fd/@var{N} 5455The file associated with file descriptor @var{N}. Such a file must have 5456been opened by the program initiating the @code{awk} execution (typically 5457the shell). Unless you take special pains in the shell from which 5458you invoke @code{gawk}, only descriptors 0, 1 and 2 are available. 5459@end table 5460@c @end cartouche 5461 5462The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr} 5463are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2}, 5464respectively, but they are more self-explanatory. 5465 5466The proper way to write an error message in a @code{gawk} program 5467is to use @file{/dev/stderr}, like this: 5468 5469@example 5470print "Serious error detected!" > "/dev/stderr" 5471@end example 5472 5473@code{gawk} also provides special file names that give access to information 5474about the running @code{gawk} process. Each of these ``files'' provides 5475a single record of information. To read them more than once, you must 5476first close them with the @code{close} function 5477(@pxref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}). 5478The filenames are: 5479 5480@cindex process information 5481@cindex @file{/dev/pid} 5482@cindex @file{/dev/pgrpid} 5483@cindex @file{/dev/ppid} 5484@cindex @file{/dev/user} 5485@c @cartouche 5486@table @file 5487@item /dev/pid 5488Reading this file returns the process ID of the current process, 5489in decimal, terminated with a newline. 5490 5491@item /dev/ppid 5492Reading this file returns the parent process ID of the current process, 5493in decimal, terminated with a newline. 5494 5495@item /dev/pgrpid 5496Reading this file returns the process group ID of the current process, 5497in decimal, terminated with a newline. 5498 5499@item /dev/user 5500Reading this file returns a single record terminated with a newline. 5501The fields are separated with spaces. The fields represent the 5502following information: 5503 5504@table @code 5505@item $1 5506The return value of the @code{getuid} system call 5507(the real user ID number). 5508 5509@item $2 5510The return value of the @code{geteuid} system call 5511(the effective user ID number). 5512 5513@item $3 5514The return value of the @code{getgid} system call 5515(the real group ID number). 5516 5517@item $4 5518The return value of the @code{getegid} system call 5519(the effective group ID number). 5520@end table 5521 5522If there are any additional fields, they are the group IDs returned by 5523@code{getgroups} system call. 5524(Multiple groups may not be supported on all systems.) 5525@end table 5526@c @end cartouche 5527 5528These special file names may be used on the command line as data 5529files, as well as for I/O redirections within an @code{awk} program. 5530They may not be used as source files with the @samp{-f} option. 5531 5532Recognition of these special file names is disabled if @code{gawk} is in 5533compatibility mode (@pxref{Options, ,Command Line Options}). 5534 5535@strong{Caution}: Unless your system actually has a @file{/dev/fd} directory 5536(or any of the other above listed special files), 5537the interpretation of these file names is done by @code{gawk} itself. 5538For example, using @samp{/dev/fd/4} for output will actually write on 5539file descriptor 4, and not on a new file descriptor that was @code{dup}'ed 5540from file descriptor 4. Most of the time this does not matter; however, it 5541is important to @emph{not} close any of the files related to file descriptors 55420, 1, and 2. If you do close one of these files, unpredictable behavior 5543will result. 5544 5545The special files that provide process-related information will disappear 5546in a future version of @code{gawk}. 5547@xref{Future Extensions, ,Probable Future Extensions}. 5548 5549@node Close Files And Pipes, , Special Files, Printing 5550@section Closing Input and Output Files and Pipes 5551@cindex closing input files and pipes 5552@cindex closing output files and pipes 5553@findex close 5554 5555If the same file name or the same shell command is used with 5556@code{getline} 5557(@pxref{Getline, ,Explicit Input with @code{getline}}) 5558more than once during the execution of an @code{awk} 5559program, the file is opened (or the command is executed) only the first time. 5560At that time, the first record of input is read from that file or command. 5561The next time the same file or command is used in @code{getline}, another 5562record is read from it, and so on. 5563 5564Similarly, when a file or pipe is opened for output, the file name or command 5565associated with 5566it is remembered by @code{awk} and subsequent writes to the same file or 5567command are appended to the previous writes. The file or pipe stays 5568open until @code{awk} exits. 5569 5570This implies that if you want to start reading the same file again from 5571the beginning, or if you want to rerun a shell command (rather than 5572reading more output from the command), you must take special steps. 5573What you must do is use the @code{close} function, as follows: 5574 5575@example 5576close(@var{filename}) 5577@end example 5578 5579@noindent 5580or 5581 5582@example 5583close(@var{command}) 5584@end example 5585 5586The argument @var{filename} or @var{command} can be any expression. Its 5587value must @emph{exactly} match the string that was used to open the file or 5588start the command (spaces and other ``irrelevant'' characters 5589included). For example, if you open a pipe with this: 5590 5591@example 5592"sort -r names" | getline foo 5593@end example 5594 5595@noindent 5596then you must close it with this: 5597 5598@example 5599close("sort -r names") 5600@end example 5601 5602Once this function call is executed, the next @code{getline} from that 5603file or command, or the next @code{print} or @code{printf} to that 5604file or command, will reopen the file or rerun the command. 5605 5606Because the expression that you use to close a file or pipeline must 5607exactly match the expression used to open the file or run the command, 5608it is good practice to use a variable to store the file name or command. 5609The previous example would become 5610 5611@example 5612sortcom = "sort -r names" 5613sortcom | getline foo 5614@dots{} 5615close(sortcom) 5616@end example 5617 5618@noindent 5619This helps avoid hard-to-find typographical errors in your @code{awk} 5620programs. 5621 5622Here are some reasons why you might need to close an output file: 5623 5624@itemize @bullet 5625@item 5626To write a file and read it back later on in the same @code{awk} 5627program. Close the file when you are finished writing it; then 5628you can start reading it with @code{getline}. 5629 5630@item 5631To write numerous files, successively, in the same @code{awk} 5632program. If you don't close the files, eventually you may exceed a 5633system limit on the number of open files in one process. So close 5634each one when you are finished writing it. 5635 5636@item 5637To make a command finish. When you redirect output through a pipe, 5638the command reading the pipe normally continues to try to read input 5639as long as the pipe is open. Often this means the command cannot 5640really do its work until the pipe is closed. For example, if you 5641redirect output to the @code{mail} program, the message is not 5642actually sent until the pipe is closed. 5643 5644@c NEEDED 5645@page 5646@item 5647To run the same program a second time, with the same arguments. 5648This is not the same thing as giving more input to the first run! 5649 5650For example, suppose you pipe output to the @code{mail} program. If you 5651output several lines redirected to this pipe without closing it, they make 5652a single message of several lines. By contrast, if you close the pipe 5653after each line of output, then each line makes a separate message. 5654@end itemize 5655 5656@vindex ERRNO 5657@cindex differences between @code{gawk} and @code{awk} 5658@code{close} returns a value of zero if the close succeeded. 5659Otherwise, the value will be non-zero. 5660In this case, @code{gawk} sets the variable @code{ERRNO} to a string 5661describing the error that occurred. 5662 5663@cindex differences between @code{gawk} and @code{awk} 5664@cindex portability issues 5665If you use more files than the system allows you to have open, 5666@code{gawk} will attempt to multiplex the available open files among 5667your data files. @code{gawk}'s ability to do this depends upon the 5668facilities of your operating system: it may not always work. It is 5669therefore both good practice and good portability advice to always 5670use @code{close} on your files when you are done with them. 5671 5672@node Expressions, Patterns and Actions, Printing, Top 5673@chapter Expressions 5674@cindex expression 5675 5676Expressions are the basic building blocks of @code{awk} patterns 5677and actions. An expression evaluates to a value, which you can print, test, 5678store in a variable or pass to a function. Additionally, an expression 5679can assign a new value to a variable or a field, with an assignment operator. 5680 5681An expression can serve as a pattern or action statement on its own. 5682Most other kinds of 5683statements contain one or more expressions which specify data on which to 5684operate. As in other languages, expressions in @code{awk} include 5685variables, array references, constants, and function calls, as well as 5686combinations of these with various operators. 5687 5688@menu 5689* Constants:: String, numeric, and regexp constants. 5690* Using Constant Regexps:: When and how to use a regexp constant. 5691* Variables:: Variables give names to values for later use. 5692* Conversion:: The conversion of strings to numbers and vice 5693 versa. 5694* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, 5695 etc.) 5696* Concatenation:: Concatenating strings. 5697* Assignment Ops:: Changing the value of a variable or a field. 5698* Increment Ops:: Incrementing the numeric value of a variable. 5699* Truth Values:: What is ``true'' and what is ``false''. 5700* Typing and Comparison:: How variables acquire types, and how this 5701 affects comparison of numbers and strings with 5702 @samp{<}, etc. 5703* Boolean Ops:: Combining comparison expressions using boolean 5704 operators @samp{||} (``or''), @samp{&&} 5705 (``and'') and @samp{!} (``not''). 5706* Conditional Exp:: Conditional expressions select between two 5707 subexpressions under control of a third 5708 subexpression. 5709* Function Calls:: A function call is an expression. 5710* Precedence:: How various operators nest. 5711@end menu 5712 5713@node Constants, Using Constant Regexps, Expressions, Expressions 5714@section Constant Expressions 5715@cindex constants, types of 5716@cindex string constants 5717 5718The simplest type of expression is the @dfn{constant}, which always has 5719the same value. There are three types of constants: numeric constants, 5720string constants, and regular expression constants. 5721 5722@menu 5723* Scalar Constants:: Numeric and string constants. 5724* Regexp Constants:: Regular Expression constants. 5725@end menu 5726 5727@node Scalar Constants, Regexp Constants, Constants, Constants 5728@subsection Numeric and String Constants 5729 5730@cindex numeric constant 5731@cindex numeric value 5732A @dfn{numeric constant} stands for a number. This number can be an 5733integer, a decimal fraction, or a number in scientific (exponential) 5734notation.@footnote{The internal representation uses double-precision 5735floating point numbers. If you don't know what that means, then don't 5736worry about it.} Here are some examples of numeric constants, which all 5737have the same value: 5738 5739@example 5740105 57411.05e+2 57421050e-1 5743@end example 5744 5745A string constant consists of a sequence of characters enclosed in 5746double-quote marks. For example: 5747 5748@example 5749"parrot" 5750@end example 5751 5752@noindent 5753@cindex differences between @code{gawk} and @code{awk} 5754represents the string whose contents are @samp{parrot}. Strings in 5755@code{gawk} can be of any length and they can contain any of the possible 5756eight-bit ASCII characters including ASCII NUL (character code zero). 5757Other @code{awk} 5758implementations may have difficulty with some character codes. 5759 5760@node Regexp Constants, , Scalar Constants, Constants 5761@subsection Regular Expression Constants 5762 5763@cindex @code{~} operator 5764@cindex @code{!~} operator 5765A regexp constant is a regular expression description enclosed in 5766slashes, such as @code{@w{/^beginning and end$/}}. Most regexps used in 5767@code{awk} programs are constant, but the @samp{~} and @samp{!~} 5768matching operators can also match computed or ``dynamic'' regexps 5769(which are just ordinary strings or variables that contain a regexp). 5770 5771@node Using Constant Regexps, Variables, Constants, Expressions 5772@section Using Regular Expression Constants 5773 5774When used on the right hand side of the @samp{~} or @samp{!~} 5775operators, a regexp constant merely stands for the regexp that is to be 5776matched. 5777 5778@cindex dark corner 5779Regexp constants (such as @code{/foo/}) may be used like simple expressions. 5780When a 5781regexp constant appears by itself, it has the same meaning as if it appeared 5782in a pattern, i.e.@: @samp{($0 ~ /foo/)} (d.c.) 5783(@pxref{Expression Patterns, ,Expressions as Patterns}). 5784This means that the two code segments, 5785 5786@example 5787if ($0 ~ /barfly/ || $0 ~ /camelot/) 5788 print "found" 5789@end example 5790 5791@noindent 5792and 5793 5794@example 5795if (/barfly/ || /camelot/) 5796 print "found" 5797@end example 5798 5799@noindent 5800are exactly equivalent. 5801 5802One rather bizarre consequence of this rule is that the following 5803boolean expression is valid, but does not do what the user probably 5804intended: 5805 5806@example 5807# note that /foo/ is on the left of the ~ 5808if (/foo/ ~ $1) print "found foo" 5809@end example 5810 5811@noindent 5812This code is ``obviously'' testing @code{$1} for a match against the regexp 5813@code{/foo/}. But in fact, the expression @samp{/foo/ ~ $1} actually means 5814@samp{($0 ~ /foo/) ~ $1}. In other words, first match the input record 5815against the regexp @code{/foo/}. The result will be either zero or one, 5816depending upon the success or failure of the match. Then match that result 5817against the first field in the record. 5818 5819Since it is unlikely that you would ever really wish to make this kind of 5820test, @code{gawk} will issue a warning when it sees this construct in 5821a program. 5822 5823Another consequence of this rule is that the assignment statement 5824 5825@example 5826matches = /foo/ 5827@end example 5828 5829@noindent 5830will assign either zero or one to the variable @code{matches}, depending 5831upon the contents of the current input record. 5832 5833This feature of the language was never well documented until the 5834POSIX specification. 5835 5836@cindex differences between @code{gawk} and @code{awk} 5837@cindex dark corner 5838Constant regular expressions are also used as the first argument for 5839the @code{gensub}, @code{sub} and @code{gsub} functions, and as the 5840second argument of the @code{match} function 5841(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 5842Modern implementations of @code{awk}, including @code{gawk}, allow 5843the third argument of @code{split} to be a regexp constant, while some 5844older implementations do not (d.c.). 5845 5846This can lead to confusion when attempting to use regexp constants 5847as arguments to user defined functions 5848(@pxref{User-defined, , User-defined Functions}). 5849For example: 5850 5851@example 5852@group 5853function mysub(pat, repl, str, global) 5854@{ 5855 if (global) 5856 gsub(pat, repl, str) 5857 else 5858 sub(pat, repl, str) 5859 return str 5860@} 5861@end group 5862 5863@group 5864@{ 5865 @dots{} 5866 text = "hi! hi yourself!" 5867 mysub(/hi/, "howdy", text, 1) 5868 @dots{} 5869@} 5870@end group 5871@end example 5872 5873In this example, the programmer wishes to pass a regexp constant to the 5874user-defined function @code{mysub}, which will in turn pass it on to 5875either @code{sub} or @code{gsub}. However, what really happens is that 5876the @code{pat} parameter will be either one or zero, depending upon whether 5877or not @code{$0} matches @code{/hi/}. 5878 5879As it is unlikely that you would ever really wish to pass a truth value 5880in this way, @code{gawk} will issue a warning when it sees a regexp 5881constant used as a parameter to a user-defined function. 5882 5883@node Variables, Conversion, Using Constant Regexps, Expressions 5884@section Variables 5885 5886Variables are ways of storing values at one point in your program for 5887use later in another part of your program. You can manipulate them 5888entirely within your program text, and you can also assign values to 5889them on the @code{awk} command line. 5890 5891@menu 5892* Using Variables:: Using variables in your programs. 5893* Assignment Options:: Setting variables on the command line and a 5894 summary of command line syntax. This is an 5895 advanced method of input. 5896@end menu 5897 5898@node Using Variables, Assignment Options, Variables, Variables 5899@subsection Using Variables in a Program 5900 5901@cindex variables, user-defined 5902@cindex user-defined variables 5903Variables let you give names to values and refer to them later. You have 5904already seen variables in many of the examples. The name of a variable 5905must be a sequence of letters, digits and underscores, but it may not begin 5906with a digit. Case is significant in variable names; @code{a} and @code{A} 5907are distinct variables. 5908 5909A variable name is a valid expression by itself; it represents the 5910variable's current value. Variables are given new values with 5911@dfn{assignment operators}, @dfn{increment operators} and 5912@dfn{decrement operators}. 5913@xref{Assignment Ops, ,Assignment Expressions}. 5914 5915A few variables have special built-in meanings, such as @code{FS}, the 5916field separator, and @code{NF}, the number of fields in the current 5917input record. @xref{Built-in Variables}, for a list of them. These 5918built-in variables can be used and assigned just like all other 5919variables, but their values are also used or changed automatically by 5920@code{awk}. All built-in variables names are entirely upper-case. 5921 5922Variables in @code{awk} can be assigned either numeric or string 5923values. By default, variables are initialized to the empty string, which 5924is zero if converted to a number. There is no need to 5925``initialize'' each variable explicitly in @code{awk}, 5926the way you would in C and in most other traditional languages. 5927 5928@node Assignment Options, , Using Variables, Variables 5929@subsection Assigning Variables on the Command Line 5930 5931You can set any @code{awk} variable by including a @dfn{variable assignment} 5932among the arguments on the command line when you invoke @code{awk} 5933(@pxref{Other Arguments, ,Other Command Line Arguments}). Such an assignment has 5934this form: 5935 5936@example 5937@var{variable}=@var{text} 5938@end example 5939 5940@noindent 5941With it, you can set a variable either at the beginning of the 5942@code{awk} run or in between input files. 5943 5944If you precede the assignment with the @samp{-v} option, like this: 5945 5946@example 5947-v @var{variable}=@var{text} 5948@end example 5949 5950@noindent 5951then the variable is set at the very beginning, before even the 5952@code{BEGIN} rules are run. The @samp{-v} option and its assignment 5953must precede all the file name arguments, as well as the program text. 5954(@xref{Options, ,Command Line Options}, for more information about 5955the @samp{-v} option.) 5956 5957Otherwise, the variable assignment is performed at a time determined by 5958its position among the input file arguments: after the processing of the 5959preceding input file argument. For example: 5960 5961@example 5962awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list 5963@end example 5964 5965@noindent 5966prints the value of field number @code{n} for all input records. Before 5967the first file is read, the command line sets the variable @code{n} 5968equal to four. This causes the fourth field to be printed in lines from 5969the file @file{inventory-shipped}. After the first file has finished, 5970but before the second file is started, @code{n} is set to two, so that the 5971second field is printed in lines from @file{BBS-list}. 5972 5973@example 5974@group 5975$ awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list 5976@print{} 15 5977@print{} 24 5978@dots{} 5979@print{} 555-5553 5980@print{} 555-3412 5981@dots{} 5982@end group 5983@end example 5984 5985Command line arguments are made available for explicit examination by 5986the @code{awk} program in an array named @code{ARGV} 5987(@pxref{ARGC and ARGV, ,Using @code{ARGC} and @code{ARGV}}). 5988 5989@cindex dark corner 5990@code{awk} processes the values of command line assignments for escape 5991sequences (d.c.) (@pxref{Escape Sequences}). 5992 5993@node Conversion, Arithmetic Ops, Variables, Expressions 5994@section Conversion of Strings and Numbers 5995 5996@cindex conversion of strings and numbers 5997Strings are converted to numbers, and numbers to strings, if the context 5998of the @code{awk} program demands it. For example, if the value of 5999either @code{foo} or @code{bar} in the expression @samp{foo + bar} 6000happens to be a string, it is converted to a number before the addition 6001is performed. If numeric values appear in string concatenation, they 6002are converted to strings. Consider this: 6003 6004@example 6005two = 2; three = 3 6006print (two three) + 4 6007@end example 6008 6009@noindent 6010This prints the (numeric) value 27. The numeric values of 6011the variables @code{two} and @code{three} are converted to strings and 6012concatenated together, and the resulting string is converted back to the 6013number 23, to which four is then added. 6014 6015@cindex null string 6016@cindex empty string 6017@cindex type conversion 6018If, for some reason, you need to force a number to be converted to a 6019string, concatenate the empty string, @code{""}, with that number. 6020To force a string to be converted to a number, add zero to that string. 6021 6022A string is converted to a number by interpreting any numeric prefix 6023of the string as numerals: 6024@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"} 6025has a numeric value of 25. 6026Strings that can't be interpreted as valid numbers are converted to 6027zero. 6028 6029@vindex CONVFMT 6030The exact manner in which numbers are converted into strings is controlled 6031by the @code{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}). 6032Numbers are converted using the @code{sprintf} function 6033(@pxref{String Functions, ,Built-in Functions for String Manipulation}) 6034with @code{CONVFMT} as the format 6035specifier. 6036 6037@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with 6038at least six significant digits. For some applications you will want to 6039change it to specify more precision. On most modern machines, you must 6040print 17 digits to capture a floating point number's value exactly. 6041 6042Strange results can happen if you set @code{CONVFMT} to a string that doesn't 6043tell @code{sprintf} how to format floating point numbers in a useful way. 6044For example, if you forget the @samp{%} in the format, all numbers will be 6045converted to the same constant string. 6046 6047@cindex dark corner 6048As a special case, if a number is an integer, then the result of converting 6049it to a string is @emph{always} an integer, no matter what the value of 6050@code{CONVFMT} may be. Given the following code fragment: 6051 6052@example 6053CONVFMT = "%2.2f" 6054a = 12 6055b = a "" 6056@end example 6057 6058@noindent 6059@code{b} has the value @code{"12"}, not @code{"12.00"} (d.c.). 6060 6061@cindex @code{awk} language, POSIX version 6062@cindex POSIX @code{awk} 6063@vindex OFMT 6064Prior to the POSIX standard, @code{awk} specified that the value 6065of @code{OFMT} was used for converting numbers to strings. @code{OFMT} 6066specifies the output format to use when printing numbers with @code{print}. 6067@code{CONVFMT} was introduced in order to separate the semantics of 6068conversion from the semantics of printing. Both @code{CONVFMT} and 6069@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority 6070of cases, old @code{awk} programs will not change their behavior. 6071However, this use of @code{OFMT} is something to keep in mind if you must 6072port your program to other implementations of @code{awk}; we recommend 6073that instead of changing your programs, you just port @code{gawk} itself! 6074@xref{Print, ,The @code{print} Statement}, 6075for more information on the @code{print} statement. 6076 6077@node Arithmetic Ops, Concatenation, Conversion, Expressions 6078@section Arithmetic Operators 6079@cindex arithmetic operators 6080@cindex operators, arithmetic 6081@cindex addition 6082@cindex subtraction 6083@cindex multiplication 6084@cindex division 6085@cindex remainder 6086@cindex quotient 6087@cindex exponentiation 6088 6089The @code{awk} language uses the common arithmetic operators when 6090evaluating expressions. All of these arithmetic operators follow normal 6091precedence rules, and work as you would expect them to. Arithmetic 6092operations are evaluated using double precision floating point, which 6093has the usual problems of inexactness and exceptions.@footnote{David 6094Goldberg, @uref{http://www.validgh.com/goldberg/paper.ps, @cite{What Every 6095Computer Scientist Should Know About Floating-point Arithmetic}}, 6096@cite{ACM Computing Surveys} @strong{23}, 1 (1991-03), 5-48.} 6097 6098Here is a file @file{grades} containing a list of student names and 6099three test scores per student (it's a small class): 6100 6101@example 6102Pat 100 97 58 6103Sandy 84 72 93 6104Chris 72 92 89 6105@end example 6106 6107@noindent 6108This programs takes the file @file{grades}, and prints the average 6109of the scores. 6110 6111@example 6112$ awk '@{ sum = $2 + $3 + $4 ; avg = sum / 3 6113> print $1, avg @}' grades 6114@print{} Pat 85 6115@print{} Sandy 83 6116@print{} Chris 84.3333 6117@end example 6118 6119This table lists the arithmetic operators in @code{awk}, in order from 6120highest precedence to lowest: 6121 6122@c @cartouche 6123@table @code 6124@item - @var{x} 6125Negation. 6126 6127@item + @var{x} 6128Unary plus. The expression is converted to a number. 6129 6130@cindex @code{awk} language, POSIX version 6131@cindex POSIX @code{awk} 6132@item @var{x} ^ @var{y} 6133@itemx @var{x} ** @var{y} 6134Exponentiation: @var{x} raised to the @var{y} power. @samp{2 ^ 3} has 6135the value eight. The character sequence @samp{**} is equivalent to 6136@samp{^}. (The POSIX standard only specifies the use of @samp{^} 6137for exponentiation.) 6138 6139@item @var{x} * @var{y} 6140Multiplication. 6141 6142@item @var{x} / @var{y} 6143Division. Since all numbers in @code{awk} are 6144floating point numbers, the result is not rounded to an integer: @samp{3 / 4} 6145has the value 0.75. 6146 6147@item @var{x} % @var{y} 6148@cindex differences between @code{gawk} and @code{awk} 6149Remainder. The quotient is rounded toward zero to an integer, 6150multiplied by @var{y} and this result is subtracted from @var{x}. 6151This operation is sometimes known as ``trunc-mod.'' The following 6152relation always holds: 6153 6154@example 6155b * int(a / b) + (a % b) == a 6156@end example 6157 6158One possibly undesirable effect of this definition of remainder is that 6159@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus, 6160 6161@example 6162-17 % 8 = -1 6163@end example 6164 6165In other @code{awk} implementations, the signedness of the remainder 6166may be machine dependent. 6167@c !!! what does posix say? 6168 6169@item @var{x} + @var{y} 6170Addition. 6171 6172@item @var{x} - @var{y} 6173Subtraction. 6174@end table 6175@c @end cartouche 6176 6177For maximum portability, do not use the @samp{**} operator. 6178 6179Unary plus and minus have the same precedence, 6180the multiplication operators all have the same precedence, and 6181addition and subtraction have the same precedence. 6182 6183@node Concatenation, Assignment Ops, Arithmetic Ops, Expressions 6184@section String Concatenation 6185@cindex Kernighan, Brian 6186@display 6187@i{It seemed like a good idea at the time.} 6188Brian Kernighan 6189@end display 6190@sp 1 6191 6192@cindex string operators 6193@cindex operators, string 6194@cindex concatenation 6195There is only one string operation: concatenation. It does not have a 6196specific operator to represent it. Instead, concatenation is performed by 6197writing expressions next to one another, with no operator. For example: 6198 6199@example 6200@group 6201$ awk '@{ print "Field number one: " $1 @}' BBS-list 6202@print{} Field number one: aardvark 6203@print{} Field number one: alpo-net 6204@dots{} 6205@end group 6206@end example 6207 6208Without the space in the string constant after the @samp{:}, the line 6209would run together. For example: 6210 6211@example 6212@group 6213$ awk '@{ print "Field number one:" $1 @}' BBS-list 6214@print{} Field number one:aardvark 6215@print{} Field number one:alpo-net 6216@dots{} 6217@end group 6218@end example 6219 6220Since string concatenation does not have an explicit operator, it is 6221often necessary to insure that it happens where you want it to by 6222using parentheses to enclose 6223the items to be concatenated. For example, the 6224following code fragment does not concatenate @code{file} and @code{name} 6225as you might expect: 6226 6227@example 6228@group 6229file = "file" 6230name = "name" 6231print "something meaningful" > file name 6232@end group 6233@end example 6234 6235@noindent 6236It is necessary to use the following: 6237 6238@example 6239print "something meaningful" > (file name) 6240@end example 6241 6242We recommend that you use parentheses around concatenation in all but the 6243most common contexts (such as on the right-hand side of @samp{=}). 6244 6245@node Assignment Ops, Increment Ops, Concatenation, Expressions 6246@section Assignment Expressions 6247@cindex assignment operators 6248@cindex operators, assignment 6249@cindex expression, assignment 6250 6251An @dfn{assignment} is an expression that stores a new value into a 6252variable. For example, let's assign the value one to the variable 6253@code{z}: 6254 6255@example 6256z = 1 6257@end example 6258 6259After this expression is executed, the variable @code{z} has the value one. 6260Whatever old value @code{z} had before the assignment is forgotten. 6261 6262Assignments can store string values also. For example, this would store 6263the value @code{"this food is good"} in the variable @code{message}: 6264 6265@example 6266thing = "food" 6267predicate = "good" 6268message = "this " thing " is " predicate 6269@end example 6270 6271@noindent 6272(This also illustrates string concatenation.) 6273 6274The @samp{=} sign is called an @dfn{assignment operator}. It is the 6275simplest assignment operator because the value of the right-hand 6276operand is stored unchanged. 6277 6278@cindex side effect 6279Most operators (addition, concatenation, and so on) have no effect 6280except to compute a value. If you ignore the value, you might as well 6281not use the operator. An assignment operator is different; it does 6282produce a value, but even if you ignore the value, the assignment still 6283makes itself felt through the alteration of the variable. We call this 6284a @dfn{side effect}. 6285 6286@cindex lvalue 6287@cindex rvalue 6288The left-hand operand of an assignment need not be a variable 6289(@pxref{Variables}); it can also be a field 6290(@pxref{Changing Fields, ,Changing the Contents of a Field}) or 6291an array element (@pxref{Arrays, ,Arrays in @code{awk}}). 6292These are all called @dfn{lvalues}, 6293which means they can appear on the left-hand side of an assignment operator. 6294The right-hand operand may be any expression; it produces the new value 6295which the assignment stores in the specified variable, field or array 6296element. (Such values are called @dfn{rvalues}). 6297 6298@cindex types of variables 6299It is important to note that variables do @emph{not} have permanent types. 6300The type of a variable is simply the type of whatever value it happens 6301to hold at the moment. In the following program fragment, the variable 6302@code{foo} has a numeric value at first, and a string value later on: 6303 6304@example 6305@group 6306foo = 1 6307print foo 6308foo = "bar" 6309print foo 6310@end group 6311@end example 6312 6313@noindent 6314When the second assignment gives @code{foo} a string value, the fact that 6315it previously had a numeric value is forgotten. 6316 6317String values that do not begin with a digit have a numeric value of 6318zero. After executing this code, the value of @code{foo} is five: 6319 6320@example 6321foo = "a string" 6322foo = foo + 5 6323@end example 6324 6325@noindent 6326(Note that using a variable as a number and then later as a string can 6327be confusing and is poor programming style. The above examples illustrate how 6328@code{awk} works, @emph{not} how you should write your own programs!) 6329 6330An assignment is an expression, so it has a value: the same value that 6331is assigned. Thus, @samp{z = 1} as an expression has the value one. 6332One consequence of this is that you can write multiple assignments together: 6333 6334@example 6335x = y = z = 0 6336@end example 6337 6338@noindent 6339stores the value zero in all three variables. It does this because the 6340value of @samp{z = 0}, which is zero, is stored into @code{y}, and then 6341the value of @samp{y = z = 0}, which is zero, is stored into @code{x}. 6342 6343You can use an assignment anywhere an expression is called for. For 6344example, it is valid to write @samp{x != (y = 1)} to set @code{y} to one 6345and then test whether @code{x} equals one. But this style tends to make 6346programs hard to read; except in a one-shot program, you should 6347not use such nesting of assignments. 6348 6349Aside from @samp{=}, there are several other assignment operators that 6350do arithmetic with the old value of the variable. For example, the 6351operator @samp{+=} computes a new value by adding the right-hand value 6352to the old value of the variable. Thus, the following assignment adds 6353five to the value of @code{foo}: 6354 6355@example 6356foo += 5 6357@end example 6358 6359@noindent 6360This is equivalent to the following: 6361 6362@example 6363foo = foo + 5 6364@end example 6365 6366@noindent 6367Use whichever one makes the meaning of your program clearer. 6368 6369There are situations where using @samp{+=} (or any assignment operator) 6370is @emph{not} the same as simply repeating the left-hand operand in the 6371right-hand expression. For example: 6372 6373@cindex Rankin, Pat 6374@example 6375@group 6376# Thanks to Pat Rankin for this example 6377BEGIN @{ 6378 foo[rand()] += 5 6379 for (x in foo) 6380 print x, foo[x] 6381 6382 bar[rand()] = bar[rand()] + 5 6383 for (x in bar) 6384 print x, bar[x] 6385@} 6386@end group 6387@end example 6388 6389@noindent 6390The indices of @code{bar} are guaranteed to be different, because 6391@code{rand} will return different values each time it is called. 6392(Arrays and the @code{rand} function haven't been covered yet. 6393@xref{Arrays, ,Arrays in @code{awk}}, 6394and see @ref{Numeric Functions, ,Numeric Built-in Functions}, for more information). 6395This example illustrates an important fact about the assignment 6396operators: the left-hand expression is only evaluated @emph{once}. 6397 6398It is also up to the implementation as to which expression is evaluated 6399first, the left-hand one or the right-hand one. 6400Consider this example: 6401 6402@example 6403i = 1 6404a[i += 2] = i + 1 6405@end example 6406 6407@noindent 6408The value of @code{a[3]} could be either two or four. 6409 6410Here is a table of the arithmetic assignment operators. In each 6411case, the right-hand operand is an expression whose value is converted 6412to a number. 6413 6414@c @cartouche 6415@table @code 6416@item @var{lvalue} += @var{increment} 6417Adds @var{increment} to the value of @var{lvalue} to make the new value 6418of @var{lvalue}. 6419 6420@item @var{lvalue} -= @var{decrement} 6421Subtracts @var{decrement} from the value of @var{lvalue}. 6422 6423@item @var{lvalue} *= @var{coefficient} 6424Multiplies the value of @var{lvalue} by @var{coefficient}. 6425 6426@item @var{lvalue} /= @var{divisor} 6427Divides the value of @var{lvalue} by @var{divisor}. 6428 6429@item @var{lvalue} %= @var{modulus} 6430Sets @var{lvalue} to its remainder by @var{modulus}. 6431 6432@cindex @code{awk} language, POSIX version 6433@cindex POSIX @code{awk} 6434@item @var{lvalue} ^= @var{power} 6435@itemx @var{lvalue} **= @var{power} 6436Raises @var{lvalue} to the power @var{power}. 6437(Only the @samp{^=} operator is specified by POSIX.) 6438@end table 6439@c @end cartouche 6440 6441For maximum portability, do not use the @samp{**=} operator. 6442 6443@node Increment Ops, Truth Values, Assignment Ops, Expressions 6444@section Increment and Decrement Operators 6445 6446@cindex increment operators 6447@cindex operators, increment 6448@dfn{Increment} and @dfn{decrement operators} increase or decrease the value of 6449a variable by one. You could do the same thing with an assignment operator, so 6450the increment operators add no power to the @code{awk} language; but they 6451are convenient abbreviations for very common operations. 6452 6453The operator to add one is written @samp{++}. It can be used to increment 6454a variable either before or after taking its value. 6455 6456To pre-increment a variable @var{v}, write @samp{++@var{v}}. This adds 6457one to the value of @var{v} and that new value is also the value of this 6458expression. The assignment expression @samp{@var{v} += 1} is completely 6459equivalent. 6460 6461Writing the @samp{++} after the variable specifies post-increment. This 6462increments the variable value just the same; the difference is that the 6463value of the increment expression itself is the variable's @emph{old} 6464value. Thus, if @code{foo} has the value four, then the expression @samp{foo++} 6465has the value four, but it changes the value of @code{foo} to five. 6466 6467The post-increment @samp{foo++} is nearly equivalent to writing @samp{(foo 6468+= 1) - 1}. It is not perfectly equivalent because all numbers in 6469@code{awk} are floating point: in floating point, @samp{foo + 1 - 1} does 6470not necessarily equal @code{foo}. But the difference is minute as 6471long as you stick to numbers that are fairly small (less than 10e12). 6472 6473Any lvalue can be incremented. Fields and array elements are incremented 6474just like variables. (Use @samp{$(i++)} when you wish to do a field reference 6475and a variable increment at the same time. The parentheses are necessary 6476because of the precedence of the field reference operator, @samp{$}.) 6477 6478@cindex decrement operators 6479@cindex operators, decrement 6480The decrement operator @samp{--} works just like @samp{++} except that 6481it subtracts one instead of adding. Like @samp{++}, it can be used before 6482the lvalue to pre-decrement or after it to post-decrement. 6483 6484Here is a summary of increment and decrement expressions. 6485 6486@c @cartouche 6487@table @code 6488@item ++@var{lvalue} 6489This expression increments @var{lvalue} and the new value becomes the 6490value of the expression. 6491 6492@item @var{lvalue}++ 6493This expression increments @var{lvalue}, but 6494the value of the expression is the @emph{old} value of @var{lvalue}. 6495 6496@item --@var{lvalue} 6497Like @samp{++@var{lvalue}}, but instead of adding, it subtracts. It 6498decrements @var{lvalue} and delivers the value that results. 6499 6500@item @var{lvalue}-- 6501Like @samp{@var{lvalue}++}, but instead of adding, it subtracts. It 6502decrements @var{lvalue}. The value of the expression is the @emph{old} 6503value of @var{lvalue}. 6504@end table 6505@c @end cartouche 6506 6507@node Truth Values, Typing and Comparison, Increment Ops, Expressions 6508@section True and False in @code{awk} 6509@cindex truth values 6510@cindex logical true 6511@cindex logical false 6512 6513Many programming languages have a special representation for the concepts 6514of ``true'' and ``false.'' Such languages usually use the special 6515constants @code{true} and @code{false}, or perhaps their upper-case 6516equivalents. 6517 6518@cindex null string 6519@cindex empty string 6520@code{awk} is different. It borrows a very simple concept of true and 6521false from C. In @code{awk}, any non-zero numeric value, @emph{or} any 6522non-empty string value is true. Any other value (zero or the null 6523string, @code{""}) is false. The following program will print @samp{A strange 6524truth value} three times: 6525 6526@example 6527@group 6528BEGIN @{ 6529 if (3.1415927) 6530 print "A strange truth value" 6531 if ("Four Score And Seven Years Ago") 6532 print "A strange truth value" 6533 if (j = 57) 6534 print "A strange truth value" 6535@} 6536@end group 6537@end example 6538 6539@cindex dark corner 6540There is a surprising consequence of the ``non-zero or non-null'' rule: 6541The string constant @code{"0"} is actually true, since it is non-null (d.c.). 6542 6543@node Typing and Comparison, Boolean Ops, Truth Values, Expressions 6544@section Variable Typing and Comparison Expressions 6545@cindex comparison expressions 6546@cindex expression, comparison 6547@cindex expression, matching 6548@cindex relational operators 6549@cindex operators, relational 6550@cindex regexp match/non-match operators 6551@cindex variable typing 6552@cindex types of variables 6553@c 2e: consider splitting this section into subsections 6554@display 6555@i{The Guide is definitive. Reality is frequently inaccurate.} 6556The Hitchhiker's Guide to the Galaxy 6557@end display 6558@sp 1 6559 6560Unlike other programming languages, @code{awk} variables do not have a 6561fixed type. Instead, they can be either a number or a string, depending 6562upon the value that is assigned to them. 6563 6564@cindex numeric string 6565The 1992 POSIX standard introduced 6566the concept of a @dfn{numeric string}, which is simply a string that looks 6567like a number, for example, @code{@w{" +2"}}. This concept is used 6568for determining the type of a variable. 6569 6570The type of the variable is important, since the types of two variables 6571determine how they are compared. 6572 6573In @code{gawk}, variable typing follows these rules. 6574 6575@enumerate 1 6576@item 6577A numeric literal or the result of a numeric operation has the @var{numeric} 6578attribute. 6579 6580@item 6581A string literal or the result of a string operation has the @var{string} 6582attribute. 6583 6584@item 6585Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} elements, 6586@code{ENVIRON} elements and the 6587elements of an array created by @code{split} that are numeric strings 6588have the @var{strnum} attribute. Otherwise, they have the @var{string} 6589attribute. 6590Uninitialized variables also have the @var{strnum} attribute. 6591 6592@item 6593Attributes propagate across assignments, but are not changed by 6594any use. 6595@c (Although a use may cause the entity to acquire an additional 6596@c value such that it has both a numeric and string value -- this leaves the 6597@c attribute unchanged.) 6598@c This is important but not relevant 6599@end enumerate 6600 6601The last rule is particularly important. In the following program, 6602@code{a} has numeric type, even though it is later used in a string 6603operation. 6604 6605@example 6606BEGIN @{ 6607 a = 12.345 6608 b = a " is a cute number" 6609 print b 6610@} 6611@end example 6612 6613When two operands are compared, either string comparison or numeric comparison 6614may be used, depending on the attributes of the operands, according to the 6615following, symmetric, matrix: 6616 6617@c thanks to Karl Berry, kb@cs.umb.edu, for major help with TeX tables 6618@tex 6619\centerline{ 6620\vbox{\bigskip % space above the table (about 1 linespace) 6621% Because we have vertical rules, we can't let TeX insert interline space 6622% in its usual way. 6623\offinterlineskip 6624% 6625% Define the table template. & separates columns, and \cr ends the 6626% template (and each row). # is replaced by the text of that entry on 6627% each row. The template for the first column breaks down like this: 6628% \strut -- a way to make each line have the height and depth 6629% of a normal line of type, since we turned off interline spacing. 6630% \hfil -- infinite glue; has the effect of right-justifying in this case. 6631% # -- replaced by the text (for instance, `STRNUM', in the last row). 6632% \quad -- about the width of an `M'. Just separates the columns. 6633% 6634% The second column (\vrule#) is what generates the vertical rule that 6635% spans table rows. 6636% 6637% The doubled && before the next entry means `repeat the following 6638% template as many times as necessary on each line' -- in our case, twice. 6639% 6640% The template itself, \quad#\hfil, left-justifies with a little space before. 6641% 6642\halign{\strut\hfil#\quad&\vrule#&&\quad#\hfil\cr 6643 &&STRING &NUMERIC &STRNUM\cr 6644% The \omit tells TeX to skip inserting the template for this column on 6645% this particular row. In this case, we only want a little extra space 6646% to separate the heading row from the rule below it. the depth 2pt -- 6647% `\vrule depth 2pt' is that little space. 6648\omit &depth 2pt\cr 6649% This is the horizontal rule below the heading. Since it has nothing to 6650% do with the columns of the table, we use \noalign to get it in there. 6651\noalign{\hrule} 6652% Like above, this time a little more space. 6653\omit &depth 4pt\cr 6654% The remaining rows have nothing special about them. 6655STRING &&string &string &string\cr 6656NUMERIC &&string &numeric &numeric\cr 6657STRNUM &&string &numeric &numeric\cr 6658}}} 6659@end tex 6660@ifinfo 6661@display 6662 +---------------------------------------------- 6663 | STRING NUMERIC STRNUM 6664--------+---------------------------------------------- 6665 | 6666STRING | string string string 6667 | 6668NUMERIC | string numeric numeric 6669 | 6670STRNUM | string numeric numeric 6671--------+---------------------------------------------- 6672@end display 6673@end ifinfo 6674 6675The basic idea is that user input that looks numeric, and @emph{only} 6676user input, should be treated as numeric, even though it is actually 6677made of characters, and is therefore also a string. 6678 6679@dfn{Comparison expressions} compare strings or numbers for 6680relationships such as equality. They are written using @dfn{relational 6681operators}, which are a superset of those in C. Here is a table of 6682them: 6683 6684@cindex relational operators 6685@cindex operators, relational 6686@cindex @code{<} operator 6687@cindex @code{<=} operator 6688@cindex @code{>} operator 6689@cindex @code{>=} operator 6690@cindex @code{==} operator 6691@cindex @code{!=} operator 6692@cindex @code{~} operator 6693@cindex @code{!~} operator 6694@cindex @code{in} operator 6695@c @cartouche 6696@table @code 6697@item @var{x} < @var{y} 6698True if @var{x} is less than @var{y}. 6699 6700@item @var{x} <= @var{y} 6701True if @var{x} is less than or equal to @var{y}. 6702 6703@item @var{x} > @var{y} 6704True if @var{x} is greater than @var{y}. 6705 6706@item @var{x} >= @var{y} 6707True if @var{x} is greater than or equal to @var{y}. 6708 6709@item @var{x} == @var{y} 6710True if @var{x} is equal to @var{y}. 6711 6712@item @var{x} != @var{y} 6713True if @var{x} is not equal to @var{y}. 6714 6715@item @var{x} ~ @var{y} 6716True if the string @var{x} matches the regexp denoted by @var{y}. 6717 6718@item @var{x} !~ @var{y} 6719True if the string @var{x} does not match the regexp denoted by @var{y}. 6720 6721@item @var{subscript} in @var{array} 6722True if the array @var{array} has an element with the subscript @var{subscript}. 6723@end table 6724@c @end cartouche 6725 6726Comparison expressions have the value one if true and zero if false. 6727 6728When comparing operands of mixed types, numeric operands are converted 6729to strings using the value of @code{CONVFMT} 6730(@pxref{Conversion, ,Conversion of Strings and Numbers}). 6731 6732Strings are compared 6733by comparing the first character of each, then the second character of each, 6734and so on. Thus @code{"10"} is less than @code{"9"}. If there are two 6735strings where one is a prefix of the other, the shorter string is less than 6736the longer one. Thus @code{"abc"} is less than @code{"abcd"}. 6737 6738@cindex common mistakes 6739@cindex mistakes, common 6740@cindex errors, common 6741It is very easy to accidentally mistype the @samp{==} operator, and 6742leave off one of the @samp{=}s. The result is still valid @code{awk} 6743code, but the program will not do what you mean: 6744 6745@example 6746if (a = b) # oops! should be a == b 6747 @dots{} 6748else 6749 @dots{} 6750@end example 6751 6752@noindent 6753Unless @code{b} happens to be zero or the null string, the @code{if} 6754part of the test will always succeed. Because the operators are 6755so similar, this kind of error is very difficult to spot when 6756scanning the source code. 6757 6758Here are some sample expressions, how @code{gawk} compares them, and what 6759the result of the comparison is. 6760 6761@table @code 6762@item 1.5 <= 2.0 6763numeric comparison (true) 6764 6765@item "abc" >= "xyz" 6766string comparison (false) 6767 6768@item 1.5 != " +2" 6769string comparison (true) 6770 6771@item "1e2" < "3" 6772string comparison (true) 6773 6774@item a = 2; b = "2" 6775@itemx a == b 6776string comparison (true) 6777 6778@item a = 2; b = " +2" 6779@itemx a == b 6780string comparison (false) 6781@end table 6782 6783In this example, 6784 6785@example 6786@group 6787$ echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}' 6788@print{} false 6789@end group 6790@end example 6791 6792@noindent 6793the result is @samp{false} since both @code{$1} and @code{$2} are numeric 6794strings and thus both have the @var{strnum} attribute, 6795dictating a numeric comparison. 6796 6797The purpose of the comparison rules and the use of numeric strings is 6798to attempt to produce the behavior that is ``least surprising,'' while 6799still ``doing the right thing.'' 6800 6801@cindex comparisons, string vs. regexp 6802@cindex string comparison vs. regexp comparison 6803@cindex regexp comparison vs. string comparison 6804String comparisons and regular expression comparisons are very different. 6805For example, 6806 6807@example 6808x == "foo" 6809@end example 6810 6811@noindent 6812has the value of one, or is true, if the variable @code{x} 6813is precisely @samp{foo}. By contrast, 6814 6815@example 6816x ~ /foo/ 6817@end example 6818 6819@noindent 6820has the value one if @code{x} contains @samp{foo}, such as 6821@code{"Oh, what a fool am I!"}. 6822 6823The right hand operand of the @samp{~} and @samp{!~} operators may be 6824either a regexp constant (@code{/@dots{}/}), or an ordinary 6825expression, in which case the value of the expression as a string is used as a 6826dynamic regexp (@pxref{Regexp Usage, ,How to Use Regular Expressions}; also 6827@pxref{Computed Regexps, ,Using Dynamic Regexps}). 6828 6829@cindex regexp as expression 6830In recent implementations of @code{awk}, a constant regular 6831expression in slashes by itself is also an expression. The regexp 6832@code{/@var{regexp}/} is an abbreviation for this comparison expression: 6833 6834@example 6835$0 ~ /@var{regexp}/ 6836@end example 6837 6838One special place where @code{/foo/} is @emph{not} an abbreviation for 6839@samp{$0 ~ /foo/} is when it is the right-hand operand of @samp{~} or 6840@samp{!~}! 6841@xref{Using Constant Regexps, ,Using Regular Expression Constants}, 6842where this is discussed in more detail. 6843 6844@c This paragraph has been here since day 1, and has always bothered 6845@c me, especially since the expression doesn't really make a lot of 6846@c sense. So, just take it out. 6847@ignore 6848In some contexts it may be necessary to write parentheses around the 6849regexp to avoid confusing the @code{gawk} parser. For example, 6850@samp{(/x/ - /y/) > threshold} is not allowed, but @samp{((/x/) - (/y/)) 6851> threshold} parses properly. 6852@end ignore 6853 6854@node Boolean Ops, Conditional Exp, Typing and Comparison, Expressions 6855@section Boolean Expressions 6856@cindex expression, boolean 6857@cindex boolean expressions 6858@cindex operators, boolean 6859@cindex boolean operators 6860@cindex logical operations 6861@cindex operations, logical 6862@cindex short-circuit operators 6863@cindex operators, short-circuit 6864@cindex and operator 6865@cindex or operator 6866@cindex not operator 6867@cindex @code{&&} operator 6868@cindex @code{||} operator 6869@cindex @code{!} operator 6870 6871A @dfn{boolean expression} is a combination of comparison expressions or 6872matching expressions, using the boolean operators ``or'' 6873(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with 6874parentheses to control nesting. The truth value of the boolean expression is 6875computed by combining the truth values of the component expressions. 6876Boolean expressions are also referred to as @dfn{logical expressions}. 6877The terms are equivalent. 6878 6879Boolean expressions can be used wherever comparison and matching 6880expressions can be used. They can be used in @code{if}, @code{while}, 6881@code{do} and @code{for} statements 6882(@pxref{Statements, ,Control Statements in Actions}). 6883They have numeric values (one if true, zero if false), which come into play 6884if the result of the boolean expression is stored in a variable, or 6885used in arithmetic. 6886 6887In addition, every boolean expression is also a valid pattern, so 6888you can use one as a pattern to control the execution of rules. 6889 6890Here are descriptions of the three boolean operators, with examples. 6891 6892@c @cartouche 6893@table @code 6894@item @var{boolean1} && @var{boolean2} 6895True if both @var{boolean1} and @var{boolean2} are true. For example, 6896the following statement prints the current input record if it contains 6897both @samp{2400} and @samp{foo}. 6898 6899@example 6900if ($0 ~ /2400/ && $0 ~ /foo/) print 6901@end example 6902 6903The subexpression @var{boolean2} is evaluated only if @var{boolean1} 6904is true. This can make a difference when @var{boolean2} contains 6905expressions that have side effects: in the case of @samp{$0 ~ /foo/ && 6906($2 == bar++)}, the variable @code{bar} is not incremented if there is 6907no @samp{foo} in the record. 6908 6909@item @var{boolean1} || @var{boolean2} 6910True if at least one of @var{boolean1} or @var{boolean2} is true. 6911For example, the following statement prints all records in the input 6912that contain @emph{either} @samp{2400} or 6913@samp{foo}, or both. 6914 6915@example 6916if ($0 ~ /2400/ || $0 ~ /foo/) print 6917@end example 6918 6919The subexpression @var{boolean2} is evaluated only if @var{boolean1} 6920is false. This can make a difference when @var{boolean2} contains 6921expressions that have side effects. 6922 6923@item ! @var{boolean} 6924True if @var{boolean} is false. For example, the following program prints 6925all records in the input file @file{BBS-list} that do @emph{not} contain the 6926string @samp{foo}. 6927 6928@c A better example would be `if (! (subscript in array)) ...' but we 6929@c haven't done anything with arrays or `in' yet. Sigh. 6930@example 6931awk '@{ if (! ($0 ~ /foo/)) print @}' BBS-list 6932@end example 6933@end table 6934@c @end cartouche 6935 6936The @samp{&&} and @samp{||} operators are called @dfn{short-circuit} 6937operators because of the way they work. Evaluation of the full expression 6938is ``short-circuited'' if the result can be determined part way through 6939its evaluation. 6940 6941@cindex line continuation 6942You can continue a statement that uses @samp{&&} or @samp{||} simply 6943by putting a newline after them. But you cannot put a newline in front 6944of either of these operators without using backslash continuation 6945(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}). 6946 6947The actual value of an expression using the @samp{!} operator will be 6948either one or zero, depending upon the truth value of the expression it 6949is applied to. 6950 6951The @samp{!} operator is often useful for changing the sense of a flag 6952variable from false to true and back again. For example, the following 6953program is one way to print lines in between special bracketing lines: 6954 6955@example 6956$1 == "START" @{ interested = ! interested @} 6957interested == 1 @{ print @} 6958$1 == "END" @{ interested = ! interested @} 6959@end example 6960 6961@noindent 6962The variable @code{interested}, like all @code{awk} variables, starts 6963out initialized to zero, which is also false. When a line is seen whose 6964first field is @samp{START}, the value of @code{interested} is toggled 6965to true, using @samp{!}. The next rule prints lines as long as 6966@code{interested} is true. When a line is seen whose first field is 6967@samp{END}, @code{interested} is toggled back to false. 6968@ignore 6969We should discuss using `next' in the two rules that toggle the 6970variable, to avoid printing the bracketing lines, but that's more 6971distraction than really needed. 6972@end ignore 6973 6974@node Conditional Exp, Function Calls, Boolean Ops, Expressions 6975@section Conditional Expressions 6976@cindex conditional expression 6977@cindex expression, conditional 6978 6979A @dfn{conditional expression} is a special kind of expression with 6980three operands. It allows you to use one expression's value to select 6981one of two other expressions. 6982 6983The conditional expression is the same as in the C language: 6984 6985@example 6986@var{selector} ? @var{if-true-exp} : @var{if-false-exp} 6987@end example 6988 6989@noindent 6990There are three subexpressions. The first, @var{selector}, is always 6991computed first. If it is ``true'' (not zero and not null) then 6992@var{if-true-exp} is computed next and its value becomes the value of 6993the whole expression. Otherwise, @var{if-false-exp} is computed next 6994and its value becomes the value of the whole expression. 6995 6996For example, this expression produces the absolute value of @code{x}: 6997 6998@example 6999x > 0 ? x : -x 7000@end example 7001 7002Each time the conditional expression is computed, exactly one of 7003@var{if-true-exp} and @var{if-false-exp} is used; the other is ignored. 7004This is important when the expressions have side effects. For example, 7005this conditional expression examines element @code{i} of either array 7006@code{a} or array @code{b}, and increments @code{i}. 7007 7008@example 7009x == y ? a[i++] : b[i++] 7010@end example 7011 7012@noindent 7013This is guaranteed to increment @code{i} exactly once, because each time 7014only one of the two increment expressions is executed, 7015and the other is not. 7016@xref{Arrays, ,Arrays in @code{awk}}, 7017for more information about arrays. 7018 7019@cindex differences between @code{gawk} and @code{awk} 7020@cindex line continuation 7021As a minor @code{gawk} extension, 7022you can continue a statement that uses @samp{?:} simply 7023by putting a newline after either character. 7024However, you cannot put a newline in front 7025of either character without using backslash continuation 7026(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}). 7027If @samp{--posix} is specified 7028(@pxref{Options, , Command Line Options}), then this extension is disabled. 7029 7030@node Function Calls, Precedence, Conditional Exp, Expressions 7031@section Function Calls 7032@cindex function call 7033@cindex calling a function 7034 7035A @dfn{function} is a name for a particular calculation. Because it has 7036a name, you can ask for it by name at any point in the program. For 7037example, the function @code{sqrt} computes the square root of a number. 7038 7039A fixed set of functions are @dfn{built-in}, which means they are 7040available in every @code{awk} program. The @code{sqrt} function is one 7041of these. @xref{Built-in, ,Built-in Functions}, for a list of built-in 7042functions and their descriptions. In addition, you can define your own 7043functions for use in your program. 7044@xref{User-defined, ,User-defined Functions}, for how to do this. 7045 7046@cindex arguments in function call 7047The way to use a function is with a @dfn{function call} expression, 7048which consists of the function name followed immediately by a list of 7049@dfn{arguments} in parentheses. The arguments are expressions which 7050provide the raw materials for the function's calculations. 7051When there is more than one argument, they are separated by commas. If 7052there are no arguments, write just @samp{()} after the function name. 7053Here are some examples: 7054 7055@example 7056sqrt(x^2 + y^2) @i{one argument} 7057atan2(y, x) @i{two arguments} 7058rand() @i{no arguments} 7059@end example 7060 7061@strong{Do not put any space between the function name and the 7062open-parenthesis!} A user-defined function name looks just like the name of 7063a variable, and space would make the expression look like concatenation 7064of a variable with an expression inside parentheses. Space before the 7065parenthesis is harmless with built-in functions, but it is best not to get 7066into the habit of using space to avoid mistakes with user-defined 7067functions. 7068 7069Each function expects a particular number of arguments. For example, the 7070@code{sqrt} function must be called with a single argument, the number 7071to take the square root of: 7072 7073@example 7074sqrt(@var{argument}) 7075@end example 7076 7077Some of the built-in functions allow you to omit the final argument. 7078If you do so, they use a reasonable default. 7079@xref{Built-in, ,Built-in Functions}, for full details. If arguments 7080are omitted in calls to user-defined functions, then those arguments are 7081treated as local variables, initialized to the empty string 7082(@pxref{User-defined, ,User-defined Functions}). 7083 7084Like every other expression, the function call has a value, which is 7085computed by the function based on the arguments you give it. In this 7086example, the value of @samp{sqrt(@var{argument})} is the square root of 7087@var{argument}. A function can also have side effects, such as assigning 7088values to certain variables or doing I/O. 7089 7090Here is a command to read numbers, one number per line, and print the 7091square root of each one: 7092 7093@example 7094@group 7095$ awk '@{ print "The square root of", $1, "is", sqrt($1) @}' 70961 7097@print{} The square root of 1 is 1 70983 7099@print{} The square root of 3 is 1.73205 71005 7101@print{} The square root of 5 is 2.23607 7102@kbd{Control-d} 7103@end group 7104@end example 7105 7106@node Precedence, , Function Calls, Expressions 7107@section Operator Precedence (How Operators Nest) 7108@cindex precedence 7109@cindex operator precedence 7110 7111@dfn{Operator precedence} determines how operators are grouped, when 7112different operators appear close by in one expression. For example, 7113@samp{*} has higher precedence than @samp{+}; thus, @samp{a + b * c} 7114means to multiply @code{b} and @code{c}, and then add @code{a} to the 7115product (i.e.@: @samp{a + (b * c)}). 7116 7117You can overrule the precedence of the operators by using parentheses. 7118You can think of the precedence rules as saying where the 7119parentheses are assumed to be if you do not write parentheses yourself. In 7120fact, it is wise to always use parentheses whenever you have an unusual 7121combination of operators, because other people who read the program may 7122not remember what the precedence is in this case. You might forget, 7123too; then you could make a mistake. Explicit parentheses will help prevent 7124any such mistake. 7125 7126When operators of equal precedence are used together, the leftmost 7127operator groups first, except for the assignment, conditional and 7128exponentiation operators, which group in the opposite order. 7129Thus, @samp{a - b + c} groups as @samp{(a - b) + c}, and 7130@samp{a = b = c} groups as @samp{a = (b = c)}. 7131 7132The precedence of prefix unary operators does not matter as long as only 7133unary operators are involved, because there is only one way to interpret 7134them---innermost first. Thus, @samp{$++i} means @samp{$(++i)} and 7135@samp{++$x} means @samp{++($x)}. However, when another operator follows 7136the operand, then the precedence of the unary operators can matter. 7137Thus, @samp{$x^2} means @samp{($x)^2}, but @samp{-x^2} means 7138@samp{-(x^2)}, because @samp{-} has lower precedence than @samp{^} 7139while @samp{$} has higher precedence. 7140 7141Here is a table of @code{awk}'s operators, in order from highest 7142precedence to lowest: 7143 7144@c use @code in the items, looks better in TeX w/o all the quotes 7145@table @code 7146@item (@dots{}) 7147Grouping. 7148 7149@item $ 7150Field. 7151 7152@item ++ -- 7153Increment, decrement. 7154 7155@cindex @code{awk} language, POSIX version 7156@cindex POSIX @code{awk} 7157@item ^ ** 7158Exponentiation. These operators group right-to-left. 7159(The @samp{**} operator is not specified by POSIX.) 7160 7161@item + - ! 7162Unary plus, minus, logical ``not''. 7163 7164@item * / % 7165Multiplication, division, modulus. 7166 7167@item + - 7168Addition, subtraction. 7169 7170@item @r{Concatenation} 7171No special token is used to indicate concatenation. 7172The operands are simply written side by side. 7173 7174@item < <= == != 7175@itemx > >= >> | 7176Relational, and redirection. 7177The relational operators and the redirections have the same precedence 7178level. Characters such as @samp{>} serve both as relationals and as 7179redirections; the context distinguishes between the two meanings. 7180 7181Note that the I/O redirection operators in @code{print} and @code{printf} 7182statements belong to the statement level, not to expressions. The 7183redirection does not produce an expression which could be the operand of 7184another operator. As a result, it does not make sense to use a 7185redirection operator near another operator of lower precedence, without 7186parentheses. Such combinations, for example @samp{print foo > a ? b : c}, 7187result in syntax errors. 7188The correct way to write this statement is @samp{print foo > (a ? b : c)}. 7189 7190@item ~ !~ 7191Matching, non-matching. 7192 7193@item in 7194Array membership. 7195 7196@item && 7197Logical ``and''. 7198 7199@item || 7200Logical ``or''. 7201 7202@item ?: 7203Conditional. This operator groups right-to-left. 7204 7205@cindex @code{awk} language, POSIX version 7206@cindex POSIX @code{awk} 7207@item = += -= *= 7208@itemx /= %= ^= **= 7209Assignment. These operators group right-to-left. 7210(The @samp{**=} operator is not specified by POSIX.) 7211@end table 7212 7213@node Patterns and Actions, Statements, Expressions, Top 7214@chapter Patterns and Actions 7215@cindex pattern, definition of 7216 7217As you have already seen, each @code{awk} statement consists of 7218a pattern with an associated action. This chapter describes how 7219you build patterns and actions. 7220 7221@menu 7222* Pattern Overview:: What goes into a pattern. 7223* Action Overview:: What goes into an action. 7224@end menu 7225 7226@node Pattern Overview, Action Overview, Patterns and Actions, Patterns and Actions 7227@section Pattern Elements 7228 7229Patterns in @code{awk} control the execution of rules: a rule is 7230executed when its pattern matches the current input record. This 7231section explains all about how to write patterns. 7232 7233@menu 7234* Kinds of Patterns:: A list of all kinds of patterns. 7235* Regexp Patterns:: Using regexps as patterns. 7236* Expression Patterns:: Any expression can be used as a pattern. 7237* Ranges:: Pairs of patterns specify record ranges. 7238* BEGIN/END:: Specifying initialization and cleanup rules. 7239* Empty:: The empty pattern, which matches every record. 7240@end menu 7241 7242@node Kinds of Patterns, Regexp Patterns, Pattern Overview, Pattern Overview 7243@subsection Kinds of Patterns 7244@cindex patterns, types of 7245 7246Here is a summary of the types of patterns supported in @code{awk}. 7247 7248@table @code 7249@item /@var{regular expression}/ 7250A regular expression as a pattern. It matches when the text of the 7251input record fits the regular expression. 7252(@xref{Regexp, ,Regular Expressions}.) 7253 7254@item @var{expression} 7255A single expression. It matches when its value 7256is non-zero (if a number) or non-null (if a string). 7257(@xref{Expression Patterns, ,Expressions as Patterns}.) 7258 7259@item @var{pat1}, @var{pat2} 7260A pair of patterns separated by a comma, specifying a range of records. 7261The range includes both the initial record that matches @var{pat1}, and 7262the final record that matches @var{pat2}. 7263(@xref{Ranges, ,Specifying Record Ranges with Patterns}.) 7264 7265@item BEGIN 7266@itemx END 7267Special patterns for you to supply start-up or clean-up actions for your 7268@code{awk} program. 7269(@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}.) 7270 7271@item @var{empty} 7272The empty pattern matches every input record. 7273(@xref{Empty, ,The Empty Pattern}.) 7274@end table 7275 7276@node Regexp Patterns, Expression Patterns, Kinds of Patterns, Pattern Overview 7277@subsection Regular Expressions as Patterns 7278 7279We have been using regular expressions as patterns since our early examples. 7280This kind of pattern is simply a regexp constant in the pattern part of 7281a rule. Its meaning is @samp{$0 ~ /@var{pattern}/}. 7282The pattern matches when the input record matches the regexp. 7283For example: 7284 7285@example 7286/foo|bar|baz/ @{ buzzwords++ @} 7287END @{ print buzzwords, "buzzwords seen" @} 7288@end example 7289 7290@node Expression Patterns, Ranges, Regexp Patterns, Pattern Overview 7291@subsection Expressions as Patterns 7292 7293Any @code{awk} expression is valid as an @code{awk} pattern. 7294Then the pattern matches if the expression's value is non-zero (if a 7295number) or non-null (if a string). 7296 7297The expression is reevaluated each time the rule is tested against a new 7298input record. If the expression uses fields such as @code{$1}, the 7299value depends directly on the new input record's text; otherwise, it 7300depends only on what has happened so far in the execution of the 7301@code{awk} program, but that may still be useful. 7302 7303A very common kind of expression used as a pattern is the comparison 7304expression, using the comparison operators described in 7305@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}. 7306 7307Regexp matching and non-matching are also very common expressions. 7308The left operand of the @samp{~} and @samp{!~} operators is a string. 7309The right operand is either a constant regular expression enclosed in 7310slashes (@code{/@var{regexp}/}), or any expression, whose string value 7311is used as a dynamic regular expression 7312(@pxref{Computed Regexps, , Using Dynamic Regexps}). 7313 7314The following example prints the second field of each input record 7315whose first field is precisely @samp{foo}. 7316 7317@example 7318$ awk '$1 == "foo" @{ print $2 @}' BBS-list 7319@end example 7320 7321@noindent 7322(There is no output, since there is no BBS site named ``foo''.) 7323Contrast this with the following regular expression match, which would 7324accept any record with a first field that contains @samp{foo}: 7325 7326@example 7327@group 7328$ awk '$1 ~ /foo/ @{ print $2 @}' BBS-list 7329@print{} 555-1234 7330@print{} 555-6699 7331@print{} 555-6480 7332@print{} 555-2127 7333@end group 7334@end example 7335 7336Boolean expressions are also commonly used as patterns. 7337Whether the pattern 7338matches an input record depends on whether its subexpressions match. 7339 7340For example, the following command prints all records in 7341@file{BBS-list} that contain both @samp{2400} and @samp{foo}. 7342 7343@example 7344$ awk '/2400/ && /foo/' BBS-list 7345@print{} fooey 555-1234 2400/1200/300 B 7346@end example 7347 7348The following command prints all records in 7349@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}, or 7350both. 7351 7352@example 7353@group 7354$ awk '/2400/ || /foo/' BBS-list 7355@print{} alpo-net 555-3412 2400/1200/300 A 7356@print{} bites 555-1675 2400/1200/300 A 7357@print{} fooey 555-1234 2400/1200/300 B 7358@print{} foot 555-6699 1200/300 B 7359@print{} macfoo 555-6480 1200/300 A 7360@print{} sdace 555-3430 2400/1200/300 A 7361@print{} sabafoo 555-2127 1200/300 C 7362@end group 7363@end example 7364 7365The following command prints all records in 7366@file{BBS-list} that do @emph{not} contain the string @samp{foo}. 7367 7368@example 7369@group 7370$ awk '! /foo/' BBS-list 7371@print{} aardvark 555-5553 1200/300 B 7372@print{} alpo-net 555-3412 2400/1200/300 A 7373@print{} barfly 555-7685 1200/300 A 7374@print{} bites 555-1675 2400/1200/300 A 7375@print{} camelot 555-0542 300 C 7376@print{} core 555-2912 1200/300 C 7377@print{} sdace 555-3430 2400/1200/300 A 7378@end group 7379@end example 7380 7381The subexpressions of a boolean operator in a pattern can be constant regular 7382expressions, comparisons, or any other @code{awk} expressions. Range 7383patterns are not expressions, so they cannot appear inside boolean 7384patterns. Likewise, the special patterns @code{BEGIN} and @code{END}, 7385which never match any input record, are not expressions and cannot 7386appear inside boolean patterns. 7387 7388A regexp constant as a pattern is also a special case of an expression 7389pattern. @code{/foo/} as an expression has the value one if @samp{foo} 7390appears in the current input record; thus, as a pattern, @code{/foo/} 7391matches any record containing @samp{foo}. 7392 7393@node Ranges, BEGIN/END, Expression Patterns, Pattern Overview 7394@subsection Specifying Record Ranges with Patterns 7395 7396@cindex range pattern 7397@cindex pattern, range 7398@cindex matching ranges of lines 7399A @dfn{range pattern} is made of two patterns separated by a comma, of 7400the form @samp{@var{begpat}, @var{endpat}}. It matches ranges of 7401consecutive input records. The first pattern, @var{begpat}, controls 7402where the range begins, and the second one, @var{endpat}, controls where 7403it ends. For example, 7404 7405@example 7406awk '$1 == "on", $1 == "off"' 7407@end example 7408 7409@noindent 7410prints every record between @samp{on}/@samp{off} pairs, inclusive. 7411 7412A range pattern starts out by matching @var{begpat} 7413against every input record; when a record matches @var{begpat}, the 7414range pattern becomes @dfn{turned on}. The range pattern matches this 7415record. As long as it stays turned on, it automatically matches every 7416input record read. It also matches @var{endpat} against 7417every input record; when that succeeds, the range pattern is turned 7418off again for the following record. Then it goes back to checking 7419@var{begpat} against each record. 7420 7421The record that turns on the range pattern and the one that turns it 7422off both match the range pattern. If you don't want to operate on 7423these records, you can write @code{if} statements in the rule's action 7424to distinguish them from the records you are interested in. 7425 7426It is possible for a pattern to be turned both on and off by the same 7427record, if the record satisfies both conditions. Then the action is 7428executed for just that record. 7429 7430For example, suppose you have text between two identical markers (say 7431the @samp{%} symbol) that you wish to ignore. You might try to 7432combine a range pattern that describes the delimited text with the 7433@code{next} statement 7434(not discussed yet, @pxref{Next Statement, , The @code{next} Statement}), 7435which causes @code{awk} to skip any further processing of the current 7436record and start over again with the next input record. Such a program 7437would look like this: 7438 7439@example 7440/^%$/,/^%$/ @{ next @} 7441 @{ print @} 7442@end example 7443 7444@noindent 7445@cindex skipping lines between markers 7446This program fails because the range pattern is both turned on and turned off 7447by the first line with just a @samp{%} on it. To accomplish this task, you 7448must write the program this way, using a flag: 7449 7450@example 7451/^%$/ @{ skip = ! skip; next @} 7452skip == 1 @{ next @} # skip lines with `skip' set 7453@end example 7454 7455Note that in a range pattern, the @samp{,} has the lowest precedence 7456(is evaluated last) of all the operators. Thus, for example, the 7457following program attempts to combine a range pattern with another, 7458simpler test. 7459 7460@example 7461echo Yes | awk '/1/,/2/ || /Yes/' 7462@end example 7463 7464The author of this program intended it to mean @samp{(/1/,/2/) || /Yes/}. 7465However, @code{awk} interprets this as @samp{/1/, (/2/ || /Yes/)}. 7466This cannot be changed or worked around; range patterns do not combine 7467with other patterns. 7468 7469@node BEGIN/END, Empty, Ranges, Pattern Overview 7470@subsection The @code{BEGIN} and @code{END} Special Patterns 7471 7472@cindex @code{BEGIN} special pattern 7473@cindex pattern, @code{BEGIN} 7474@cindex @code{END} special pattern 7475@cindex pattern, @code{END} 7476@code{BEGIN} and @code{END} are special patterns. They are not used to 7477match input records. Rather, they supply start-up or 7478clean-up actions for your @code{awk} script. 7479 7480@menu 7481* Using BEGIN/END:: How and why to use BEGIN/END rules. 7482* I/O And BEGIN/END:: I/O issues in BEGIN/END rules. 7483@end menu 7484 7485@node Using BEGIN/END, I/O And BEGIN/END, BEGIN/END, BEGIN/END 7486@subsubsection Startup and Cleanup Actions 7487 7488A @code{BEGIN} rule is executed, once, before the first input record 7489has been read. An @code{END} rule is executed, once, after all the 7490input has been read. For example: 7491 7492@example 7493@group 7494$ awk ' 7495> BEGIN @{ print "Analysis of \"foo\"" @} 7496> /foo/ @{ ++n @} 7497> END @{ print "\"foo\" appears " n " times." @}' BBS-list 7498@print{} Analysis of "foo" 7499@print{} "foo" appears 4 times. 7500@end group 7501@end example 7502 7503This program finds the number of records in the input file @file{BBS-list} 7504that contain the string @samp{foo}. The @code{BEGIN} rule prints a title 7505for the report. There is no need to use the @code{BEGIN} rule to 7506initialize the counter @code{n} to zero, as @code{awk} does this 7507automatically (@pxref{Variables}). 7508 7509The second rule increments the variable @code{n} every time a 7510record containing the pattern @samp{foo} is read. The @code{END} rule 7511prints the value of @code{n} at the end of the run. 7512 7513The special patterns @code{BEGIN} and @code{END} cannot be used in ranges 7514or with boolean operators (indeed, they cannot be used with any operators). 7515 7516An @code{awk} program may have multiple @code{BEGIN} and/or @code{END} 7517rules. They are executed in the order they appear, all the @code{BEGIN} 7518rules at start-up and all the @code{END} rules at termination. 7519@code{BEGIN} and @code{END} rules may be intermixed with other rules. 7520This feature was added in the 1987 version of @code{awk}, and is included 7521in the POSIX standard. The original (1978) version of @code{awk} 7522required you to put the @code{BEGIN} rule at the beginning of the 7523program, and the @code{END} rule at the end, and only allowed one of 7524each. This is no longer required, but it is a good idea in terms of 7525program organization and readability. 7526 7527Multiple @code{BEGIN} and @code{END} rules are useful for writing 7528library functions, since each library file can have its own @code{BEGIN} and/or 7529@code{END} rule to do its own initialization and/or cleanup. Note that 7530the order in which library functions are named on the command line 7531controls the order in which their @code{BEGIN} and @code{END} rules are 7532executed. Therefore you have to be careful to write such rules in 7533library files so that the order in which they are executed doesn't matter. 7534@xref{Options, ,Command Line Options}, for more information on 7535using library functions. 7536@xref{Library Functions, ,A Library of @code{awk} Functions}, 7537for a number of useful library functions. 7538 7539@cindex dark corner 7540If an @code{awk} program only has a @code{BEGIN} rule, and no other 7541rules, then the program exits after the @code{BEGIN} rule has been run. 7542(The original version of @code{awk} used to keep reading and ignoring input 7543until end of file was seen.) However, if an @code{END} rule exists, 7544then the input will be read, even if there are no other rules in 7545the program. This is necessary in case the @code{END} rule checks the 7546@code{FNR} and @code{NR} variables (d.c.). 7547 7548@code{BEGIN} and @code{END} rules must have actions; there is no default 7549action for these rules since there is no current record when they run. 7550 7551@node I/O And BEGIN/END, , Using BEGIN/END, BEGIN/END 7552@subsubsection Input/Output from @code{BEGIN} and @code{END} Rules 7553 7554@cindex I/O from @code{BEGIN} and @code{END} 7555There are several (sometimes subtle) issues involved when doing I/O 7556from a @code{BEGIN} or @code{END} rule. 7557 7558The first has to do with the value of @code{$0} in a @code{BEGIN} 7559rule. Since @code{BEGIN} rules are executed before any input is read, 7560there simply is no input record, and therefore no fields, when 7561executing @code{BEGIN} rules. References to @code{$0} and the fields 7562yield a null string or zero, depending upon the context. One way 7563to give @code{$0} a real value is to execute a @code{getline} command 7564without a variable (@pxref{Getline, ,Explicit Input with @code{getline}}). 7565Another way is to simply assign a value to it. 7566 7567@cindex differences between @code{gawk} and @code{awk} 7568The second point is similar to the first, but from the other direction. 7569Inside an @code{END} rule, what is the value of @code{$0} and @code{NF}? 7570Traditionally, due largely to implementation issues, @code{$0} and 7571@code{NF} were @emph{undefined} inside an @code{END} rule. 7572The POSIX standard specified that @code{NF} was available in an @code{END} 7573rule, containing the number of fields from the last input record. 7574Due most probably to an oversight, the standard does not say that @code{$0} 7575is also preserved, although logically one would think that it should be. 7576In fact, @code{gawk} does preserve the value of @code{$0} for use in 7577@code{END} rules. Be aware, however, that Unix @code{awk}, and possibly 7578other implementations, do not. 7579 7580The third point follows from the first two. What is the meaning of 7581@samp{print} inside a @code{BEGIN} or @code{END} rule? The meaning is 7582the same as always, @samp{print $0}. If @code{$0} is the null string, 7583then this prints an empty line. Many long time @code{awk} programmers 7584use @samp{print} in @code{BEGIN} and @code{END} rules, to mean 7585@samp{@w{print ""}}, relying on @code{$0} being null. While you might 7586generally get away with this in @code{BEGIN} rules, in @code{gawk} at 7587least, it is a very bad idea in @code{END} rules. It is also poor 7588style, since if you want an empty line in the output, you 7589should say so explicitly in your program. 7590 7591@node Empty, , BEGIN/END, Pattern Overview 7592@subsection The Empty Pattern 7593 7594@cindex empty pattern 7595@cindex pattern, empty 7596An empty (i.e.@: non-existent) pattern is considered to match @emph{every} 7597input record. For example, the program: 7598 7599@example 7600awk '@{ print $1 @}' BBS-list 7601@end example 7602 7603@noindent 7604prints the first field of every record. 7605 7606@node Action Overview, , Pattern Overview, Patterns and Actions 7607@section Overview of Actions 7608@cindex action, definition of 7609@cindex curly braces 7610@cindex action, curly braces 7611@cindex action, separating statements 7612 7613An @code{awk} program or script consists of a series of 7614rules and function definitions, interspersed. (Functions are 7615described later. @xref{User-defined, ,User-defined Functions}.) 7616 7617A rule contains a pattern and an action, either of which (but not 7618both) may be 7619omitted. The purpose of the @dfn{action} is to tell @code{awk} what to do 7620once a match for the pattern is found. Thus, in outline, an @code{awk} 7621program generally looks like this: 7622 7623@example 7624@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]} 7625@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]} 7626@dots{} 7627function @var{name}(@var{args}) @{ @dots{} @} 7628@dots{} 7629@end example 7630 7631An action consists of one or more @code{awk} @dfn{statements}, enclosed 7632in curly braces (@samp{@{} and @samp{@}}). Each statement specifies one 7633thing to be done. The statements are separated by newlines or 7634semicolons. 7635 7636The curly braces around an action must be used even if the action 7637contains only one statement, or even if it contains no statements at 7638all. However, if you omit the action entirely, omit the curly braces as 7639well. An omitted action is equivalent to @samp{@{ print $0 @}}. 7640 7641@example 7642/foo/ @{ @} # match foo, do nothing - empty action 7643/foo/ # match foo, print the record - omitted action 7644@end example 7645 7646Here are the kinds of statements supported in @code{awk}: 7647 7648@itemize @bullet 7649@item 7650Expressions, which can call functions or assign values to variables 7651(@pxref{Expressions}). Executing 7652this kind of statement simply computes the value of the expression. 7653This is useful when the expression has side effects 7654(@pxref{Assignment Ops, ,Assignment Expressions}). 7655 7656@item 7657Control statements, which specify the control flow of @code{awk} 7658programs. The @code{awk} language gives you C-like constructs 7659(@code{if}, @code{for}, @code{while}, and @code{do}) as well as a few 7660special ones (@pxref{Statements, ,Control Statements in Actions}). 7661 7662@item 7663Compound statements, which consist of one or more statements enclosed in 7664curly braces. A compound statement is used in order to put several 7665statements together in the body of an @code{if}, @code{while}, @code{do} 7666or @code{for} statement. 7667 7668@item 7669Input statements, using the @code{getline} command 7670(@pxref{Getline, ,Explicit Input with @code{getline}}), the @code{next} 7671statement (@pxref{Next Statement, ,The @code{next} Statement}), 7672and the @code{nextfile} statement 7673(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}). 7674 7675@item 7676Output statements, @code{print} and @code{printf}. 7677@xref{Printing, ,Printing Output}. 7678 7679@item 7680Deletion statements, for deleting array elements. 7681@xref{Delete, ,The @code{delete} Statement}. 7682@end itemize 7683 7684@iftex 7685The next chapter covers control statements in detail. 7686@end iftex 7687 7688@node Statements, Built-in Variables, Patterns and Actions, Top 7689@chapter Control Statements in Actions 7690@cindex control statement 7691 7692@dfn{Control statements} such as @code{if}, @code{while}, and so on 7693control the flow of execution in @code{awk} programs. Most of the 7694control statements in @code{awk} are patterned on similar statements in 7695C. 7696 7697All the control statements start with special keywords such as @code{if} 7698and @code{while}, to distinguish them from simple expressions. 7699 7700@cindex compound statement 7701@cindex statement, compound 7702Many control statements contain other statements; for example, the 7703@code{if} statement contains another statement which may or may not be 7704executed. The contained statement is called the @dfn{body}. If you 7705want to include more than one statement in the body, group them into a 7706single @dfn{compound statement} with curly braces, separating them with 7707newlines or semicolons. 7708 7709@menu 7710* If Statement:: Conditionally execute some @code{awk} 7711 statements. 7712* While Statement:: Loop until some condition is satisfied. 7713* Do Statement:: Do specified action while looping until some 7714 condition is satisfied. 7715* For Statement:: Another looping statement, that provides 7716 initialization and increment clauses. 7717* Break Statement:: Immediately exit the innermost enclosing loop. 7718* Continue Statement:: Skip to the end of the innermost enclosing 7719 loop. 7720* Next Statement:: Stop processing the current input record. 7721* Nextfile Statement:: Stop processing the current file. 7722* Exit Statement:: Stop execution of @code{awk}. 7723@end menu 7724 7725@node If Statement, While Statement, Statements, Statements 7726@section The @code{if}-@code{else} Statement 7727 7728@cindex @code{if}-@code{else} statement 7729The @code{if}-@code{else} statement is @code{awk}'s decision-making 7730statement. It looks like this: 7731 7732@example 7733if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]} 7734@end example 7735 7736@noindent 7737The @var{condition} is an expression that controls what the rest of the 7738statement will do. If @var{condition} is true, @var{then-body} is 7739executed; otherwise, @var{else-body} is executed. 7740The @code{else} part of the statement is 7741optional. The condition is considered false if its value is zero or 7742the null string, and true otherwise. 7743 7744Here is an example: 7745 7746@example 7747if (x % 2 == 0) 7748 print "x is even" 7749else 7750 print "x is odd" 7751@end example 7752 7753In this example, if the expression @samp{x % 2 == 0} is true (that is, 7754the value of @code{x} is evenly divisible by two), then the first @code{print} 7755statement is executed, otherwise the second @code{print} statement is 7756executed. 7757 7758If the @code{else} appears on the same line as @var{then-body}, and 7759@var{then-body} is not a compound statement (i.e.@: not surrounded by 7760curly braces), then a semicolon must separate @var{then-body} from 7761@code{else}. To illustrate this, let's rewrite the previous example: 7762 7763@example 7764if (x % 2 == 0) print "x is even"; else 7765 print "x is odd" 7766@end example 7767 7768@noindent 7769If you forget the @samp{;}, @code{awk} won't be able to interpret the 7770statement, and you will get a syntax error. 7771 7772We would not actually write this example this way, because a human 7773reader might fail to see the @code{else} if it were not the first thing 7774on its line. 7775 7776@node While Statement, Do Statement, If Statement, Statements 7777@section The @code{while} Statement 7778@cindex @code{while} statement 7779@cindex loop 7780@cindex body of a loop 7781 7782In programming, a @dfn{loop} means a part of a program that can 7783be executed two or more times in succession. 7784 7785The @code{while} statement is the simplest looping statement in 7786@code{awk}. It repeatedly executes a statement as long as a condition is 7787true. It looks like this: 7788 7789@example 7790while (@var{condition}) 7791 @var{body} 7792@end example 7793 7794@noindent 7795Here @var{body} is a statement that we call the @dfn{body} of the loop, 7796and @var{condition} is an expression that controls how long the loop 7797keeps running. 7798 7799The first thing the @code{while} statement does is test @var{condition}. 7800If @var{condition} is true, it executes the statement @var{body}. 7801@ifinfo 7802(The @var{condition} is true when the value 7803is not zero and not a null string.) 7804@end ifinfo 7805After @var{body} has been executed, 7806@var{condition} is tested again, and if it is still true, @var{body} is 7807executed again. This process repeats until @var{condition} is no longer 7808true. If @var{condition} is initially false, the body of the loop is 7809never executed, and @code{awk} continues with the statement following 7810the loop. 7811 7812This example prints the first three fields of each record, one per line. 7813 7814@example 7815awk '@{ i = 1 7816 while (i <= 3) @{ 7817 print $i 7818 i++ 7819 @} 7820@}' inventory-shipped 7821@end example 7822 7823@noindent 7824Here the body of the loop is a compound statement enclosed in braces, 7825containing two statements. 7826 7827The loop works like this: first, the value of @code{i} is set to one. 7828Then, the @code{while} tests whether @code{i} is less than or equal to 7829three. This is true when @code{i} equals one, so the @code{i}-th 7830field is printed. Then the @samp{i++} increments the value of @code{i} 7831and the loop repeats. The loop terminates when @code{i} reaches four. 7832 7833As you can see, a newline is not required between the condition and the 7834body; but using one makes the program clearer unless the body is a 7835compound statement or is very simple. The newline after the open-brace 7836that begins the compound statement is not required either, but the 7837program would be harder to read without it. 7838 7839@node Do Statement, For Statement, While Statement, Statements 7840@section The @code{do}-@code{while} Statement 7841 7842The @code{do} loop is a variation of the @code{while} looping statement. 7843The @code{do} loop executes the @var{body} once, and then repeats @var{body} 7844as long as @var{condition} is true. It looks like this: 7845 7846@example 7847@group 7848do 7849 @var{body} 7850while (@var{condition}) 7851@end group 7852@end example 7853 7854Even if @var{condition} is false at the start, @var{body} is executed at 7855least once (and only once, unless executing @var{body} makes 7856@var{condition} true). Contrast this with the corresponding 7857@code{while} statement: 7858 7859@example 7860while (@var{condition}) 7861 @var{body} 7862@end example 7863 7864@noindent 7865This statement does not execute @var{body} even once if @var{condition} 7866is false to begin with. 7867 7868Here is an example of a @code{do} statement: 7869 7870@example 7871awk '@{ i = 1 7872 do @{ 7873 print $0 7874 i++ 7875 @} while (i <= 10) 7876@}' 7877@end example 7878 7879@noindent 7880This program prints each input record ten times. It isn't a very 7881realistic example, since in this case an ordinary @code{while} would do 7882just as well. But this reflects actual experience; there is only 7883occasionally a real use for a @code{do} statement. 7884 7885@node For Statement, Break Statement, Do Statement, Statements 7886@section The @code{for} Statement 7887@cindex @code{for} statement 7888 7889The @code{for} statement makes it more convenient to count iterations of a 7890loop. The general form of the @code{for} statement looks like this: 7891 7892@example 7893for (@var{initialization}; @var{condition}; @var{increment}) 7894 @var{body} 7895@end example 7896 7897@noindent 7898The @var{initialization}, @var{condition} and @var{increment} parts are 7899arbitrary @code{awk} expressions, and @var{body} stands for any 7900@code{awk} statement. 7901 7902The @code{for} statement starts by executing @var{initialization}. 7903Then, as long 7904as @var{condition} is true, it repeatedly executes @var{body} and then 7905@var{increment}. Typically @var{initialization} sets a variable to 7906either zero or one, @var{increment} adds one to it, and @var{condition} 7907compares it against the desired number of iterations. 7908 7909Here is an example of a @code{for} statement: 7910 7911@example 7912@group 7913awk '@{ for (i = 1; i <= 3; i++) 7914 print $i 7915@}' inventory-shipped 7916@end group 7917@end example 7918 7919@noindent 7920This prints the first three fields of each input record, one field per 7921line. 7922 7923You cannot set more than one variable in the 7924@var{initialization} part unless you use a multiple assignment statement 7925such as @samp{x = y = 0}, which is possible only if all the initial values 7926are equal. (But you can initialize additional variables by writing 7927their assignments as separate statements preceding the @code{for} loop.) 7928 7929The same is true of the @var{increment} part; to increment additional 7930variables, you must write separate statements at the end of the loop. 7931The C compound expression, using C's comma operator, would be useful in 7932this context, but it is not supported in @code{awk}. 7933 7934Most often, @var{increment} is an increment expression, as in the 7935example above. But this is not required; it can be any expression 7936whatever. For example, this statement prints all the powers of two 7937between one and 100: 7938 7939@example 7940for (i = 1; i <= 100; i *= 2) 7941 print i 7942@end example 7943 7944Any of the three expressions in the parentheses following the @code{for} may 7945be omitted if there is nothing to be done there. Thus, @w{@samp{for (; x 7946> 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the 7947@var{condition} is omitted, it is treated as @var{true}, effectively 7948yielding an @dfn{infinite loop} (i.e.@: a loop that will never 7949terminate). 7950 7951In most cases, a @code{for} loop is an abbreviation for a @code{while} 7952loop, as shown here: 7953 7954@example 7955@var{initialization} 7956while (@var{condition}) @{ 7957 @var{body} 7958 @var{increment} 7959@} 7960@end example 7961 7962@noindent 7963The only exception is when the @code{continue} statement 7964(@pxref{Continue Statement, ,The @code{continue} Statement}) is used 7965inside the loop; changing a @code{for} statement to a @code{while} 7966statement in this way can change the effect of the @code{continue} 7967statement inside the loop. 7968 7969There is an alternate version of the @code{for} loop, for iterating over 7970all the indices of an array: 7971 7972@example 7973for (i in array) 7974 @var{do something with} array[i] 7975@end example 7976 7977@noindent 7978@xref{Scanning an Array, ,Scanning All Elements of an Array}, 7979for more information on this version of the @code{for} loop. 7980 7981The @code{awk} language has a @code{for} statement in addition to a 7982@code{while} statement because often a @code{for} loop is both less work to 7983type and more natural to think of. Counting the number of iterations is 7984very common in loops. It can be easier to think of this counting as part 7985of looping rather than as something to do inside the loop. 7986 7987The next section has more complicated examples of @code{for} loops. 7988 7989@node Break Statement, Continue Statement, For Statement, Statements 7990@section The @code{break} Statement 7991@cindex @code{break} statement 7992@cindex loops, exiting 7993 7994The @code{break} statement jumps out of the innermost @code{for}, 7995@code{while}, or @code{do} loop that encloses it. The 7996following example finds the smallest divisor of any integer, and also 7997identifies prime numbers: 7998 7999@example 8000awk '# find smallest divisor of num 8001 @{ num = $1 8002@group 8003 for (div = 2; div*div <= num; div++) 8004 if (num % div == 0) 8005 break 8006@end group 8007 if (num % div == 0) 8008 printf "Smallest divisor of %d is %d\n", num, div 8009 else 8010 printf "%d is prime\n", num 8011 @}' 8012@end example 8013 8014When the remainder is zero in the first @code{if} statement, @code{awk} 8015immediately @dfn{breaks out} of the containing @code{for} loop. This means 8016that @code{awk} proceeds immediately to the statement following the loop 8017and continues processing. (This is very different from the @code{exit} 8018statement which stops the entire @code{awk} program. 8019@xref{Exit Statement, ,The @code{exit} Statement}.) 8020 8021Here is another program equivalent to the previous one. It illustrates how 8022the @var{condition} of a @code{for} or @code{while} could just as well be 8023replaced with a @code{break} inside an @code{if}: 8024 8025@example 8026@group 8027awk '# find smallest divisor of num 8028 @{ num = $1 8029 for (div = 2; ; div++) @{ 8030 if (num % div == 0) @{ 8031 printf "Smallest divisor of %d is %d\n", num, div 8032 break 8033 @} 8034 if (div*div > num) @{ 8035 printf "%d is prime\n", num 8036 break 8037 @} 8038 @} 8039@}' 8040@end group 8041@end example 8042 8043@cindex @code{break}, outside of loops 8044@cindex historical features 8045@cindex @code{awk} language, POSIX version 8046@cindex POSIX @code{awk} 8047@cindex dark corner 8048As described above, the @code{break} statement has no meaning when 8049used outside the body of a loop. However, although it was never documented, 8050historical implementations of @code{awk} have treated the @code{break} 8051statement outside of a loop as if it were a @code{next} statement 8052(@pxref{Next Statement, ,The @code{next} Statement}). 8053Recent versions of Unix @code{awk} no longer allow this usage. 8054@code{gawk} will support this use of @code{break} only if @samp{--traditional} 8055has been specified on the command line 8056(@pxref{Options, ,Command Line Options}). 8057Otherwise, it will be treated as an error, since the POSIX standard 8058specifies that @code{break} should only be used inside the body of a 8059loop (d.c.). 8060 8061@node Continue Statement, Next Statement, Break Statement, Statements 8062@section The @code{continue} Statement 8063 8064@cindex @code{continue} statement 8065The @code{continue} statement, like @code{break}, is used only inside 8066@code{for}, @code{while}, and @code{do} loops. It skips 8067over the rest of the loop body, causing the next cycle around the loop 8068to begin immediately. Contrast this with @code{break}, which jumps out 8069of the loop altogether. 8070 8071@c The point of this program was to illustrate the use of continue with 8072@c a while loop. But Karl Berry points out that that is done adequately 8073@c below, and that this example is very un-awk-like. So for now, we'll 8074@c omit it. 8075@ignore 8076In Texinfo source files, text that the author wishes to ignore can be 8077enclosed between lines that start with @samp{@@ignore} and end with 8078@samp{@atend ignore}. Here is a program that strips out lines between 8079@samp{@@ignore} and @samp{@atend ignore} pairs. 8080 8081@example 8082BEGIN @{ 8083 while (getline > 0) @{ 8084 if (/^@@ignore/) 8085 ignoring = 1 8086 else if (/^@@end[ \t]+ignore/) @{ 8087 ignoring = 0 8088 continue 8089 @} 8090 if (ignoring) 8091 continue 8092 print 8093 @} 8094@} 8095@end example 8096 8097When an @samp{@@ignore} is seen, the @code{ignoring} flag is set to one (true). 8098When @samp{@atend ignore} is seen, the flag is reset to zero (false). As long 8099as the flag is true, the input record is not printed, because the 8100@code{continue} restarts the @code{while} loop, skipping over the @code{print} 8101statement. 8102 8103@c Exercise!!! 8104@c How could this program be written to make better use of the awk language? 8105@end ignore 8106 8107The @code{continue} statement in a @code{for} loop directs @code{awk} to 8108skip the rest of the body of the loop, and resume execution with the 8109increment-expression of the @code{for} statement. The following program 8110illustrates this fact: 8111 8112@example 8113awk 'BEGIN @{ 8114 for (x = 0; x <= 20; x++) @{ 8115 if (x == 5) 8116 continue 8117 printf "%d ", x 8118 @} 8119 print "" 8120@}' 8121@end example 8122 8123@noindent 8124This program prints all the numbers from zero to 20, except for five, for 8125which the @code{printf} is skipped. Since the increment @samp{x++} 8126is not skipped, @code{x} does not remain stuck at five. Contrast the 8127@code{for} loop above with this @code{while} loop: 8128 8129@example 8130awk 'BEGIN @{ 8131 x = 0 8132 while (x <= 20) @{ 8133 if (x == 5) 8134 continue 8135 printf "%d ", x 8136 x++ 8137 @} 8138 print "" 8139@}' 8140@end example 8141 8142@noindent 8143This program loops forever once @code{x} gets to five. 8144 8145@cindex @code{continue}, outside of loops 8146@cindex historical features 8147@cindex @code{awk} language, POSIX version 8148@cindex POSIX @code{awk} 8149@cindex dark corner 8150As described above, the @code{continue} statement has no meaning when 8151used outside the body of a loop. However, although it was never documented, 8152historical implementations of @code{awk} have treated the @code{continue} 8153statement outside of a loop as if it were a @code{next} statement 8154(@pxref{Next Statement, ,The @code{next} Statement}). 8155Recent versions of Unix @code{awk} no longer allow this usage. 8156@code{gawk} will support this use of @code{continue} only if 8157@samp{--traditional} has been specified on the command line 8158(@pxref{Options, ,Command Line Options}). 8159Otherwise, it will be treated as an error, since the POSIX standard 8160specifies that @code{continue} should only be used inside the body of a 8161loop (d.c.). 8162 8163@node Next Statement, Nextfile Statement, Continue Statement, Statements 8164@section The @code{next} Statement 8165@cindex @code{next} statement 8166 8167The @code{next} statement forces @code{awk} to immediately stop processing 8168the current record and go on to the next record. This means that no 8169further rules are executed for the current record. The rest of the 8170current rule's action is not executed either. 8171 8172Contrast this with the effect of the @code{getline} function 8173(@pxref{Getline, ,Explicit Input with @code{getline}}). That too causes 8174@code{awk} to read the next record immediately, but it does not alter the 8175flow of control in any way. So the rest of the current action executes 8176with a new input record. 8177 8178At the highest level, @code{awk} program execution is a loop that reads 8179an input record and then tests each rule's pattern against it. If you 8180think of this loop as a @code{for} statement whose body contains the 8181rules, then the @code{next} statement is analogous to a @code{continue} 8182statement: it skips to the end of the body of this implicit loop, and 8183executes the increment (which reads another record). 8184 8185For example, if your @code{awk} program works only on records with four 8186fields, and you don't want it to fail when given bad input, you might 8187use this rule near the beginning of the program: 8188 8189@example 8190@group 8191NF != 4 @{ 8192 err = sprintf("%s:%d: skipped: NF != 4\n", FILENAME, FNR) 8193 print err > "/dev/stderr" 8194 next 8195@} 8196@end group 8197@end example 8198 8199@noindent 8200so that the following rules will not see the bad record. The error 8201message is redirected to the standard error output stream, as error 8202messages should be. @xref{Special Files, ,Special File Names in @code{gawk}}. 8203 8204@cindex @code{awk} language, POSIX version 8205@cindex POSIX @code{awk} 8206According to the POSIX standard, the behavior is undefined if 8207the @code{next} statement is used in a @code{BEGIN} or @code{END} rule. 8208@code{gawk} will treat it as a syntax error. 8209Although POSIX permits it, 8210some other @code{awk} implementations don't allow the @code{next} 8211statement inside function bodies 8212(@pxref{User-defined, ,User-defined Functions}). 8213Just as any other @code{next} statement, a @code{next} inside a 8214function body reads the next record and starts processing it with the 8215first rule in the program. 8216 8217If the @code{next} statement causes the end of the input to be reached, 8218then the code in any @code{END} rules will be executed. 8219@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}. 8220 8221@cindex @code{next}, inside a user-defined function 8222@strong{Caution:} Some @code{awk} implementations generate a run-time 8223error if you use the @code{next} statement inside a user-defined function 8224(@pxref{User-defined, , User-defined Functions}). 8225@code{gawk} does not have this problem. 8226 8227@node Nextfile Statement, Exit Statement, Next Statement, Statements 8228@section The @code{nextfile} Statement 8229@cindex @code{nextfile} statement 8230@cindex differences between @code{gawk} and @code{awk} 8231 8232@code{gawk} provides the @code{nextfile} statement, 8233which is similar to the @code{next} statement. 8234However, instead of abandoning processing of the current record, the 8235@code{nextfile} statement instructs @code{gawk} to stop processing the 8236current data file. 8237 8238Upon execution of the @code{nextfile} statement, @code{FILENAME} is 8239updated to the name of the next data file listed on the command line, 8240@code{FNR} is reset to one, @code{ARGIND} is incremented, and processing 8241starts over with the first rule in the progam. @xref{Built-in Variables}. 8242 8243If the @code{nextfile} statement causes the end of the input to be reached, 8244then the code in any @code{END} rules will be executed. 8245@xref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}. 8246 8247The @code{nextfile} statement is a @code{gawk} extension; it is not 8248(currently) available in any other @code{awk} implementation. 8249@xref{Nextfile Function, ,Implementing @code{nextfile} as a Function}, 8250for a user-defined function you can use to simulate the @code{nextfile} 8251statement. 8252 8253The @code{nextfile} statement would be useful if you have many data 8254files to process, and you expect that you 8255would not want to process every record in every file. 8256Normally, in order to move on to 8257the next data file, you would have to continue scanning the unwanted 8258records. The @code{nextfile} statement accomplishes this much more 8259efficiently. 8260 8261@cindex @code{next file} statement 8262@strong{Caution:} Versions of @code{gawk} prior to 3.0 used two 8263words (@samp{next file}) for the @code{nextfile} statement. This was 8264changed in 3.0 to one word, since the treatment of @samp{file} was 8265inconsistent. When it appeared after @code{next}, it was a keyword. 8266Otherwise, it was a regular identifier. The old usage is still 8267accepted. However, @code{gawk} will generate a warning message, and 8268support for @code{next file} will eventually be discontinued in a 8269future version of @code{gawk}. 8270 8271@node Exit Statement, , Nextfile Statement, Statements 8272@section The @code{exit} Statement 8273 8274@cindex @code{exit} statement 8275The @code{exit} statement causes @code{awk} to immediately stop 8276executing the current rule and to stop processing input; any remaining input 8277is ignored. It looks like this: 8278 8279@example 8280exit @r{[}@var{return code}@r{]} 8281@end example 8282 8283If an @code{exit} statement is executed from a @code{BEGIN} rule the 8284program stops processing everything immediately. No input records are 8285read. However, if an @code{END} rule is present, it is executed 8286(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). 8287 8288If @code{exit} is used as part of an @code{END} rule, it causes 8289the program to stop immediately. 8290 8291An @code{exit} statement that is not part 8292of a @code{BEGIN} or @code{END} rule stops the execution of any further 8293automatic rules for the current record, skips reading any remaining input 8294records, and executes 8295the @code{END} rule if there is one. 8296 8297If you do not want the @code{END} rule to do its job in this case, you 8298can set a variable to non-zero before the @code{exit} statement, and check 8299that variable in the @code{END} rule. 8300@xref{Assert Function, ,Assertions}, 8301for an example that does this. 8302 8303@cindex dark corner 8304If an argument is supplied to @code{exit}, its value is used as the exit 8305status code for the @code{awk} process. If no argument is supplied, 8306@code{exit} returns status zero (success). In the case where an argument 8307is supplied to a first @code{exit} statement, and then @code{exit} is 8308called a second time with no argument, the previously supplied exit value 8309is used (d.c.). 8310 8311For example, let's say you've discovered an error condition you really 8312don't know how to handle. Conventionally, programs report this by 8313exiting with a non-zero status. Your @code{awk} program can do this 8314using an @code{exit} statement with a non-zero argument. Here is an 8315example: 8316 8317@example 8318@group 8319BEGIN @{ 8320 if (("date" | getline date_now) <= 0) @{ 8321 print "Can't get system date" > "/dev/stderr" 8322 exit 1 8323 @} 8324 print "current date is", date_now 8325 close("date") 8326@} 8327@end group 8328@end example 8329 8330@node Built-in Variables, Arrays, Statements, Top 8331@chapter Built-in Variables 8332@cindex built-in variables 8333 8334Most @code{awk} variables are available for you to use for your own 8335purposes; they never change except when your program assigns values to 8336them, and never affect anything except when your program examines them. 8337However, a few variables in @code{awk} have special built-in meanings. 8338Some of them @code{awk} examines automatically, so that they enable you 8339to tell @code{awk} how to do certain things. Others are set 8340automatically by @code{awk}, so that they carry information from the 8341internal workings of @code{awk} to your program. 8342 8343This chapter documents all the built-in variables of @code{gawk}. Most 8344of them are also documented in the chapters describing their areas of 8345activity. 8346 8347@menu 8348* User-modified:: Built-in variables that you change to control 8349 @code{awk}. 8350* Auto-set:: Built-in variables where @code{awk} gives you 8351 information. 8352* ARGC and ARGV:: Ways to use @code{ARGC} and @code{ARGV}. 8353@end menu 8354 8355@node User-modified, Auto-set, Built-in Variables, Built-in Variables 8356@section Built-in Variables that Control @code{awk} 8357@cindex built-in variables, user modifiable 8358 8359This is an alphabetical list of the variables which you can change to 8360control how @code{awk} does certain things. Those variables that are 8361specific to @code{gawk} are marked with an asterisk, @samp{*}. 8362 8363@table @code 8364@vindex CONVFMT 8365@cindex @code{awk} language, POSIX version 8366@cindex POSIX @code{awk} 8367@item CONVFMT 8368This string controls conversion of numbers to 8369strings (@pxref{Conversion, ,Conversion of Strings and Numbers}). 8370It works by being passed, in effect, as the first argument to the 8371@code{sprintf} function 8372(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 8373Its default value is @code{"%.6g"}. 8374@code{CONVFMT} was introduced by the POSIX standard. 8375 8376@vindex FIELDWIDTHS 8377@item FIELDWIDTHS * 8378This is a space separated list of columns that tells @code{gawk} 8379how to split input with fixed, columnar boundaries. It is an 8380experimental feature. Assigning to @code{FIELDWIDTHS} 8381overrides the use of @code{FS} for field splitting. 8382@xref{Constant Size, ,Reading Fixed-width Data}, for more information. 8383 8384If @code{gawk} is in compatibility mode 8385(@pxref{Options, ,Command Line Options}), then @code{FIELDWIDTHS} 8386has no special meaning, and field splitting operations are done based 8387exclusively on the value of @code{FS}. 8388 8389@vindex FS 8390@item FS 8391@code{FS} is the input field separator 8392(@pxref{Field Separators, ,Specifying How Fields are Separated}). 8393The value is a single-character string or a multi-character regular 8394expression that matches the separations between fields in an input 8395record. If the value is the null string (@code{""}), then each 8396character in the record becomes a separate field. 8397 8398The default value is @w{@code{" "}}, a string consisting of a single 8399space. As a special exception, this value means that any 8400sequence of spaces, tabs, and/or newlines is a single separator.@footnote{In 8401POSIX @code{awk}, newline does not count as whitespace.} It also causes 8402spaces, tabs, and newlines at the beginning and end of a record to be ignored. 8403 8404You can set the value of @code{FS} on the command line using the 8405@samp{-F} option: 8406 8407@example 8408awk -F, '@var{program}' @var{input-files} 8409@end example 8410 8411If @code{gawk} is using @code{FIELDWIDTHS} for field-splitting, 8412assigning a value to @code{FS} will cause @code{gawk} to return to 8413the normal, @code{FS}-based, field splitting. An easy way to do this 8414is to simply say @samp{FS = FS}, perhaps with an explanatory comment. 8415 8416@vindex IGNORECASE 8417@item IGNORECASE * 8418If @code{IGNORECASE} is non-zero or non-null, then all string comparisons, 8419and all regular expression matching are case-independent. Thus, regexp 8420matching with @samp{~} and @samp{!~}, and the @code{gensub}, 8421@code{gsub}, @code{index}, @code{match}, @code{split} and @code{sub} 8422functions, record termination with @code{RS}, and field splitting with 8423@code{FS} all ignore case when doing their particular regexp operations. 8424The value of @code{IGNORECASE} does @emph{not} affect array subscripting. 8425@xref{Case-sensitivity, ,Case-sensitivity in Matching}. 8426 8427If @code{gawk} is in compatibility mode 8428(@pxref{Options, ,Command Line Options}), 8429then @code{IGNORECASE} has no special meaning, and string 8430and regexp operations are always case-sensitive. 8431 8432@vindex OFMT 8433@item OFMT 8434This string controls conversion of numbers to 8435strings (@pxref{Conversion, ,Conversion of Strings and Numbers}) for 8436printing with the @code{print} statement. It works by being passed, in 8437effect, as the first argument to the @code{sprintf} function 8438(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 8439Its default value is @code{"%.6g"}. Earlier versions of @code{awk} 8440also used @code{OFMT} to specify the format for converting numbers to 8441strings in general expressions; this is now done by @code{CONVFMT}. 8442 8443@vindex OFS 8444@item OFS 8445This is the output field separator (@pxref{Output Separators}). It is 8446output between the fields output by a @code{print} statement. Its 8447default value is @w{@code{" "}}, a string consisting of a single space. 8448 8449@vindex ORS 8450@item ORS 8451This is the output record separator. It is output at the end of every 8452@code{print} statement. Its default value is @code{"\n"}. 8453(@xref{Output Separators}.) 8454 8455@vindex RS 8456@item RS 8457This is @code{awk}'s input record separator. Its default value is a string 8458containing a single newline character, which means that an input record 8459consists of a single line of text. 8460It can also be the null string, in which case records are separated by 8461runs of blank lines, or a regexp, in which case records are separated by 8462matches of the regexp in the input text. 8463(@xref{Records, ,How Input is Split into Records}.) 8464 8465@vindex SUBSEP 8466@item SUBSEP 8467@code{SUBSEP} is the subscript separator. It has the default value of 8468@code{"\034"}, and is used to separate the parts of the indices of a 8469multi-dimensional array. Thus, the expression @code{@w{foo["A", "B"]}} 8470really accesses @code{foo["A\034B"]} 8471(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}). 8472@end table 8473 8474@node Auto-set, ARGC and ARGV, User-modified, Built-in Variables 8475@section Built-in Variables that Convey Information 8476@cindex built-in variables, convey information 8477 8478This is an alphabetical list of the variables that are set 8479automatically by @code{awk} on certain occasions in order to provide 8480information to your program. Those variables that are specific to 8481@code{gawk} are marked with an asterisk, @samp{*}. 8482 8483@table @code 8484@vindex ARGC 8485@vindex ARGV 8486@item ARGC 8487@itemx ARGV 8488The command-line arguments available to @code{awk} programs are stored in 8489an array called @code{ARGV}. @code{ARGC} is the number of command-line 8490arguments present. @xref{Other Arguments, ,Other Command Line Arguments}. 8491Unlike most @code{awk} arrays, 8492@code{ARGV} is indexed from zero to @code{ARGC} @minus{} 1. For example: 8493 8494@example 8495@group 8496$ awk 'BEGIN @{ 8497> for (i = 0; i < ARGC; i++) 8498> print ARGV[i] 8499> @}' inventory-shipped BBS-list 8500@print{} awk 8501@print{} inventory-shipped 8502@print{} BBS-list 8503@end group 8504@end example 8505 8506@noindent 8507In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]} 8508contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains 8509@code{"BBS-list"}. The value of @code{ARGC} is three, one more than the 8510index of the last element in @code{ARGV}, since the elements are numbered 8511from zero. 8512 8513The names @code{ARGC} and @code{ARGV}, as well as the convention of indexing 8514the array from zero to @code{ARGC} @minus{} 1, are derived from the C language's 8515method of accessing command line arguments. 8516@xref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}, for information 8517about how @code{awk} uses these variables. 8518 8519@vindex ARGIND 8520@item ARGIND * 8521The index in @code{ARGV} of the current file being processed. 8522Every time @code{gawk} opens a new data file for processing, it sets 8523@code{ARGIND} to the index in @code{ARGV} of the file name. 8524When @code{gawk} is processing the input files, it is always 8525true that @samp{FILENAME == ARGV[ARGIND]}. 8526 8527This variable is useful in file processing; it allows you to tell how far 8528along you are in the list of data files, and to distinguish between 8529successive instances of the same filename on the command line. 8530 8531While you can change the value of @code{ARGIND} within your @code{awk} 8532program, @code{gawk} will automatically set it to a new value when the 8533next file is opened. 8534 8535This variable is a @code{gawk} extension. In other @code{awk} implementations, 8536or if @code{gawk} is in compatibility mode 8537(@pxref{Options, ,Command Line Options}), 8538it is not special. 8539 8540@vindex ENVIRON 8541@item ENVIRON 8542An associative array that contains the values of the environment. The array 8543indices are the environment variable names; the values are the values of 8544the particular environment variables. For example, 8545@code{ENVIRON["HOME"]} might be @file{/home/arnold}. Changing this array 8546does not affect the environment passed on to any programs that 8547@code{awk} may spawn via redirection or the @code{system} function. 8548(In a future version of @code{gawk}, it may do so.) 8549 8550Some operating systems may not have environment variables. 8551On such systems, the @code{ENVIRON} array is empty (except for 8552@w{@code{ENVIRON["AWKPATH"]}}). 8553 8554@vindex ERRNO 8555@item ERRNO * 8556If a system error occurs either doing a redirection for @code{getline}, 8557during a read for @code{getline}, or during a @code{close} operation, 8558then @code{ERRNO} will contain a string describing the error. 8559 8560This variable is a @code{gawk} extension. In other @code{awk} implementations, 8561or if @code{gawk} is in compatibility mode 8562(@pxref{Options, ,Command Line Options}), 8563it is not special. 8564 8565@cindex dark corner 8566@vindex FILENAME 8567@item FILENAME 8568This is the name of the file that @code{awk} is currently reading. 8569When no data files are listed on the command line, @code{awk} reads 8570from the standard input, and @code{FILENAME} is set to @code{"-"}. 8571@code{FILENAME} is changed each time a new file is read 8572(@pxref{Reading Files, ,Reading Input Files}). 8573Inside a @code{BEGIN} rule, the value of @code{FILENAME} is 8574@code{""}, since there are no input files being processed 8575yet.@footnote{Some early implementations of Unix @code{awk} initialized 8576@code{FILENAME} to @code{"-"}, even if there were data files to be 8577processed. This behavior was incorrect, and should not be relied 8578upon in your programs.} (d.c.) 8579 8580@vindex FNR 8581@item FNR 8582@code{FNR} is the current record number in the current file. @code{FNR} is 8583incremented each time a new record is read 8584(@pxref{Getline, ,Explicit Input with @code{getline}}). It is reinitialized 8585to zero each time a new input file is started. 8586 8587@vindex NF 8588@item NF 8589@code{NF} is the number of fields in the current input record. 8590@code{NF} is set each time a new record is read, when a new field is 8591created, or when @code{$0} changes (@pxref{Fields, ,Examining Fields}). 8592 8593@vindex NR 8594@item NR 8595This is the number of input records @code{awk} has processed since 8596the beginning of the program's execution 8597(@pxref{Records, ,How Input is Split into Records}). 8598@code{NR} is set each time a new record is read. 8599 8600@vindex RLENGTH 8601@item RLENGTH 8602@code{RLENGTH} is the length of the substring matched by the 8603@code{match} function 8604(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 8605@code{RLENGTH} is set by invoking the @code{match} function. Its value 8606is the length of the matched string, or @minus{}1 if no match was found. 8607 8608@vindex RSTART 8609@item RSTART 8610@code{RSTART} is the start-index in characters of the substring matched by the 8611@code{match} function 8612(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 8613@code{RSTART} is set by invoking the @code{match} function. Its value 8614is the position of the string where the matched substring starts, or zero 8615if no match was found. 8616 8617@vindex RT 8618@item RT * 8619@code{RT} is set each time a record is read. It contains the input text 8620that matched the text denoted by @code{RS}, the record separator. 8621 8622This variable is a @code{gawk} extension. In other @code{awk} implementations, 8623or if @code{gawk} is in compatibility mode 8624(@pxref{Options, ,Command Line Options}), 8625it is not special. 8626@end table 8627 8628@cindex dark corner 8629A side note about @code{NR} and @code{FNR}. 8630@code{awk} simply increments both of these variables 8631each time it reads a record, instead of setting them to the absolute 8632value of the number of records read. This means that your program can 8633change these variables, and their new values will be incremented for 8634each record (d.c.). For example: 8635 8636@example 8637@group 8638$ echo '1 8639> 2 8640> 3 8641> 4' | awk 'NR == 2 @{ NR = 17 @} 8642> @{ print NR @}' 8643@print{} 1 8644@print{} 17 8645@print{} 18 8646@print{} 19 8647@end group 8648@end example 8649 8650@noindent 8651Before @code{FNR} was added to the @code{awk} language 8652(@pxref{V7/SVR3.1, ,Major Changes between V7 and SVR3.1}), 8653many @code{awk} programs used this feature to track the number of 8654records in a file by resetting @code{NR} to zero when @code{FILENAME} 8655changed. 8656 8657@node ARGC and ARGV, , Auto-set, Built-in Variables 8658@section Using @code{ARGC} and @code{ARGV} 8659 8660In @ref{Auto-set, , Built-in Variables that Convey Information}, 8661you saw this program describing the information contained in @code{ARGC} 8662and @code{ARGV}: 8663 8664@example 8665@group 8666$ awk 'BEGIN @{ 8667> for (i = 0; i < ARGC; i++) 8668> print ARGV[i] 8669> @}' inventory-shipped BBS-list 8670@print{} awk 8671@print{} inventory-shipped 8672@print{} BBS-list 8673@end group 8674@end example 8675 8676@noindent 8677In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]} 8678contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains 8679@code{"BBS-list"}. 8680 8681Notice that the @code{awk} program is not entered in @code{ARGV}. The 8682other special command line options, with their arguments, are also not 8683entered. This includes variable assignments done with the @samp{-v} 8684option (@pxref{Options, ,Command Line Options}). 8685Normal variable assignments on the command line @emph{are} 8686treated as arguments, and do show up in the @code{ARGV} array. 8687 8688@example 8689$ cat showargs.awk 8690@print{} BEGIN @{ 8691@print{} printf "A=%d, B=%d\n", A, B 8692@print{} for (i = 0; i < ARGC; i++) 8693@print{} printf "\tARGV[%d] = %s\n", i, ARGV[i] 8694@print{} @} 8695@print{} END @{ printf "A=%d, B=%d\n", A, B @} 8696$ awk -v A=1 -f showargs.awk B=2 /dev/null 8697@print{} A=1, B=0 8698@print{} ARGV[0] = awk 8699@print{} ARGV[1] = B=2 8700@print{} ARGV[2] = /dev/null 8701@print{} A=1, B=2 8702@end example 8703 8704Your program can alter @code{ARGC} and the elements of @code{ARGV}. 8705Each time @code{awk} reaches the end of an input file, it uses the next 8706element of @code{ARGV} as the name of the next input file. By storing a 8707different string there, your program can change which files are read. 8708You can use @code{"-"} to represent the standard input. By storing 8709additional elements and incrementing @code{ARGC} you can cause 8710additional files to be read. 8711 8712If you decrease the value of @code{ARGC}, that eliminates input files 8713from the end of the list. By recording the old value of @code{ARGC} 8714elsewhere, your program can treat the eliminated arguments as 8715something other than file names. 8716 8717To eliminate a file from the middle of the list, store the null string 8718(@code{""}) into @code{ARGV} in place of the file's name. As a 8719special feature, @code{awk} ignores file names that have been 8720replaced with the null string. 8721You may also use the @code{delete} statement to remove elements from 8722@code{ARGV} (@pxref{Delete, ,The @code{delete} Statement}). 8723 8724All of these actions are typically done from the @code{BEGIN} rule, 8725before actual processing of the input begins. 8726@xref{Split Program, ,Splitting a Large File Into Pieces}, and see 8727@ref{Tee Program, ,Duplicating Output Into Multiple Files}, for an example 8728of each way of removing elements from @code{ARGV}. 8729 8730The following fragment processes @code{ARGV} in order to examine, and 8731then remove, command line options. 8732 8733@example 8734@group 8735BEGIN @{ 8736 for (i = 1; i < ARGC; i++) @{ 8737 if (ARGV[i] == "-v") 8738 verbose = 1 8739 else if (ARGV[i] == "-d") 8740 debug = 1 8741@end group 8742@group 8743 else if (ARGV[i] ~ /^-?/) @{ 8744 e = sprintf("%s: unrecognized option -- %c", 8745 ARGV[0], substr(ARGV[i], 1, ,1)) 8746 print e > "/dev/stderr" 8747 @} else 8748 break 8749 delete ARGV[i] 8750 @} 8751@} 8752@end group 8753@end example 8754 8755To actually get the options into the @code{awk} program, you have to 8756end the @code{awk} options with @samp{--}, and then supply your options, 8757like so: 8758 8759@example 8760awk -f myprog -- -v -d file1 file2 @dots{} 8761@end example 8762 8763@cindex differences between @code{gawk} and @code{awk} 8764This is not necessary in @code{gawk}: Unless @samp{--posix} has been 8765specified, @code{gawk} silently puts any unrecognized options into 8766@code{ARGV} for the @code{awk} program to deal with. 8767 8768As soon as it 8769sees an unknown option, @code{gawk} stops looking for other options it might 8770otherwise recognize. The above example with @code{gawk} would be: 8771 8772@example 8773gawk -f myprog -d -v file1 file2 @dots{} 8774@end example 8775 8776@noindent 8777Since @samp{-d} is not a valid @code{gawk} option, the following @samp{-v} 8778is passed on to the @code{awk} program. 8779 8780@node Arrays, Built-in, Built-in Variables, Top 8781@chapter Arrays in @code{awk} 8782 8783An @dfn{array} is a table of values, called @dfn{elements}. The 8784elements of an array are distinguished by their indices. @dfn{Indices} 8785may be either numbers or strings. @code{awk} maintains a single set 8786of names that may be used for naming variables, arrays and functions 8787(@pxref{User-defined, ,User-defined Functions}). 8788Thus, you cannot have a variable and an array with the same name in the 8789same @code{awk} program. 8790 8791@menu 8792* Array Intro:: Introduction to Arrays 8793* Reference to Elements:: How to examine one element of an array. 8794* Assigning Elements:: How to change an element of an array. 8795* Array Example:: Basic Example of an Array 8796* Scanning an Array:: A variation of the @code{for} statement. It 8797 loops through the indices of an array's 8798 existing elements. 8799* Delete:: The @code{delete} statement removes an element 8800 from an array. 8801* Numeric Array Subscripts:: How to use numbers as subscripts in 8802 @code{awk}. 8803* Uninitialized Subscripts:: Using Uninitialized variables as subscripts. 8804* Multi-dimensional:: Emulating multi-dimensional arrays in 8805 @code{awk}. 8806* Multi-scanning:: Scanning multi-dimensional arrays. 8807* Array Efficiency:: Implementation-specific tips. 8808@end menu 8809 8810@node Array Intro, Reference to Elements, Arrays, Arrays 8811@section Introduction to Arrays 8812 8813@cindex arrays 8814The @code{awk} language provides one-dimensional @dfn{arrays} for storing groups 8815of related strings or numbers. 8816 8817Every @code{awk} array must have a name. Array names have the same 8818syntax as variable names; any valid variable name would also be a valid 8819array name. But you cannot use one name in both ways (as an array and 8820as a variable) in one @code{awk} program. 8821 8822Arrays in @code{awk} superficially resemble arrays in other programming 8823languages; but there are fundamental differences. In @code{awk}, you 8824don't need to specify the size of an array before you start to use it. 8825Additionally, any number or string in @code{awk} may be used as an 8826array index, not just consecutive integers. 8827 8828In most other languages, you have to @dfn{declare} an array and specify 8829how many elements or components it contains. In such languages, the 8830declaration causes a contiguous block of memory to be allocated for that 8831many elements. An index in the array usually must be a positive integer; for 8832example, the index zero specifies the first element in the array, which is 8833actually stored at the beginning of the block of memory. Index one 8834specifies the second element, which is stored in memory right after the 8835first element, and so on. It is impossible to add more elements to the 8836array, because it has room for only as many elements as you declared. 8837(Some languages allow arbitrary starting and ending indices, 8838e.g., @samp{15 .. 27}, but the size of the array is still fixed when 8839the array is declared.) 8840 8841A contiguous array of four elements might look like this, 8842conceptually, if the element values are eight, @code{"foo"}, 8843@code{""} and 30: 8844 8845@iftex 8846@c from Karl Berry, much thanks for the help. 8847@tex 8848\bigskip % space above the table (about 1 linespace) 8849\offinterlineskip 8850\newdimen\width \width = 1.5cm 8851\newdimen\hwidth \hwidth = 4\width \advance\hwidth by 2pt % 5 * 0.4pt 8852\centerline{\vbox{ 8853\halign{\strut\hfil\ignorespaces#&&\vrule#&\hbox to\width{\hfil#\unskip\hfil}\cr 8854\noalign{\hrule width\hwidth} 8855 &&{\tt 8} &&{\tt "foo"} &&{\tt ""} &&{\tt 30} &&\quad value\cr 8856\noalign{\hrule width\hwidth} 8857\noalign{\smallskip} 8858 &\omit&0&\omit &1 &\omit&2 &\omit&3 &\omit&\quad index\cr 8859} 8860}} 8861@end tex 8862@end iftex 8863@ifinfo 8864@example 8865+---------+---------+--------+---------+ 8866| 8 | "foo" | "" | 30 | @r{value} 8867+---------+---------+--------+---------+ 8868 0 1 2 3 @r{index} 8869@end example 8870@end ifinfo 8871 8872@noindent 8873Only the values are stored; the indices are implicit from the order of 8874the values. Eight is the value at index zero, because eight appears in the 8875position with zero elements before it. 8876 8877@cindex arrays, definition of 8878@cindex associative arrays 8879@cindex arrays, associative 8880Arrays in @code{awk} are different: they are @dfn{associative}. This means 8881that each array is a collection of pairs: an index, and its corresponding 8882array element value: 8883 8884@example 8885@r{Element} 4 @r{Value} 30 8886@r{Element} 2 @r{Value} "foo" 8887@r{Element} 1 @r{Value} 8 8888@r{Element} 3 @r{Value} "" 8889@end example 8890 8891@noindent 8892We have shown the pairs in jumbled order because their order is irrelevant. 8893 8894One advantage of associative arrays is that new pairs can be added 8895at any time. For example, suppose we add to the above array a tenth element 8896whose value is @w{@code{"number ten"}}. The result is this: 8897 8898@example 8899@r{Element} 10 @r{Value} "number ten" 8900@r{Element} 4 @r{Value} 30 8901@r{Element} 2 @r{Value} "foo" 8902@r{Element} 1 @r{Value} 8 8903@r{Element} 3 @r{Value} "" 8904@end example 8905 8906@noindent 8907@cindex sparse arrays 8908@cindex arrays, sparse 8909Now the array is @dfn{sparse}, which just means some indices are missing: 8910it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9. 8911@c ok, I should spell out the above, but ... 8912 8913Another consequence of associative arrays is that the indices don't 8914have to be positive integers. Any number, or even a string, can be 8915an index. For example, here is an array which translates words from 8916English into French: 8917 8918@example 8919@r{Element} "dog" @r{Value} "chien" 8920@r{Element} "cat" @r{Value} "chat" 8921@r{Element} "one" @r{Value} "un" 8922@r{Element} 1 @r{Value} "un" 8923@end example 8924 8925@noindent 8926Here we decided to translate the number one in both spelled-out and 8927numeric form---thus illustrating that a single array can have both 8928numbers and strings as indices. 8929(In fact, array subscripts are always strings; this is discussed 8930in more detail in 8931@ref{Numeric Array Subscripts, ,Using Numbers to Subscript Arrays}.) 8932 8933@cindex Array subscripts and @code{IGNORECASE} 8934@cindex @code{IGNORECASE} and array subscripts 8935@vindex IGNORECASE 8936The value of @code{IGNORECASE} has no effect upon array subscripting. 8937You must use the exact same string value to retrieve an array element 8938as you used to store it. 8939 8940When @code{awk} creates an array for you, e.g., with the @code{split} 8941built-in function, 8942that array's indices are consecutive integers starting at one. 8943(@xref{String Functions, ,Built-in Functions for String Manipulation}.) 8944 8945@node Reference to Elements, Assigning Elements, Array Intro, Arrays 8946@section Referring to an Array Element 8947@cindex array reference 8948@cindex element of array 8949@cindex reference to array 8950 8951The principal way of using an array is to refer to one of its elements. 8952An array reference is an expression which looks like this: 8953 8954@example 8955@var{array}[@var{index}] 8956@end example 8957 8958@noindent 8959Here, @var{array} is the name of an array. The expression @var{index} is 8960the index of the element of the array that you want. 8961 8962The value of the array reference is the current value of that array 8963element. For example, @code{foo[4.3]} is an expression for the element 8964of array @code{foo} at index @samp{4.3}. 8965 8966If you refer to an array element that has no recorded value, the value 8967of the reference is @code{""}, the null string. This includes elements 8968to which you have not assigned any value, and elements that have been 8969deleted (@pxref{Delete, ,The @code{delete} Statement}). Such a reference 8970automatically creates that array element, with the null string as its value. 8971(In some cases, this is unfortunate, because it might waste memory inside 8972@code{awk}.) 8973 8974@cindex arrays, presence of elements 8975@cindex arrays, the @code{in} operator 8976You can find out if an element exists in an array at a certain index with 8977the expression: 8978 8979@example 8980@var{index} in @var{array} 8981@end example 8982 8983@noindent 8984This expression tests whether or not the particular index exists, 8985without the side effect of creating that element if it is not present. 8986The expression has the value one (true) if @code{@var{array}[@var{index}]} 8987exists, and zero (false) if it does not exist. 8988 8989For example, to test whether the array @code{frequencies} contains the 8990index @samp{2}, you could write this statement: 8991 8992@example 8993if (2 in frequencies) 8994 print "Subscript 2 is present." 8995@end example 8996 8997Note that this is @emph{not} a test of whether or not the array 8998@code{frequencies} contains an element whose @emph{value} is two. 8999(There is no way to do that except to scan all the elements.) Also, this 9000@emph{does not} create @code{frequencies[2]}, while the following 9001(incorrect) alternative would do so: 9002 9003@example 9004if (frequencies[2] != "") 9005 print "Subscript 2 is present." 9006@end example 9007 9008@node Assigning Elements, Array Example, Reference to Elements, Arrays 9009@section Assigning Array Elements 9010@cindex array assignment 9011@cindex element assignment 9012 9013Array elements are lvalues: they can be assigned values just like 9014@code{awk} variables: 9015 9016@example 9017@var{array}[@var{subscript}] = @var{value} 9018@end example 9019 9020@noindent 9021Here @var{array} is the name of your array. The expression 9022@var{subscript} is the index of the element of the array that you want 9023to assign a value. The expression @var{value} is the value you are 9024assigning to that element of the array. 9025 9026@node Array Example, Scanning an Array, Assigning Elements, Arrays 9027@section Basic Array Example 9028 9029The following program takes a list of lines, each beginning with a line 9030number, and prints them out in order of line number. The line numbers are 9031not in order, however, when they are first read: they are scrambled. This 9032program sorts the lines by making an array using the line numbers as 9033subscripts. It then prints out the lines in sorted order of their numbers. 9034It is a very simple program, and gets confused if it encounters repeated 9035numbers, gaps, or lines that don't begin with a number. 9036 9037@example 9038@group 9039@c file eg/misc/arraymax.awk 9040@{ 9041 if ($1 > max) 9042 max = $1 9043 arr[$1] = $0 9044@} 9045@end group 9046 9047END @{ 9048 for (x = 1; x <= max; x++) 9049 print arr[x] 9050@} 9051@c endfile 9052@end example 9053 9054The first rule keeps track of the largest line number seen so far; 9055it also stores each line into the array @code{arr}, at an index that 9056is the line's number. 9057 9058The second rule runs after all the input has been read, to print out 9059all the lines. 9060 9061When this program is run with the following input: 9062 9063@example 9064@group 9065@c file eg/misc/arraymax.data 90665 I am the Five man 90672 Who are you? The new number two! 90684 . . . And four on the floor 90691 Who is number one? 90703 I three you. 9071@c endfile 9072@end group 9073@end example 9074 9075@noindent 9076its output is this: 9077 9078@example 90791 Who is number one? 90802 Who are you? The new number two! 90813 I three you. 90824 . . . And four on the floor 90835 I am the Five man 9084@end example 9085 9086If a line number is repeated, the last line with a given number overrides 9087the others. 9088 9089Gaps in the line numbers can be handled with an easy improvement to the 9090program's @code{END} rule: 9091 9092@example 9093END @{ 9094 for (x = 1; x <= max; x++) 9095 if (x in arr) 9096 print arr[x] 9097@} 9098@end example 9099 9100@node Scanning an Array, Delete, Array Example, Arrays 9101@section Scanning All Elements of an Array 9102@cindex @code{for (x in @dots{})} 9103@cindex arrays, special @code{for} statement 9104@cindex scanning an array 9105 9106In programs that use arrays, you often need a loop that executes 9107once for each element of an array. In other languages, where arrays are 9108contiguous and indices are limited to positive integers, this is 9109easy: you can 9110find all the valid indices by counting from the lowest index 9111up to the highest. This 9112technique won't do the job in @code{awk}, since any number or string 9113can be an array index. So @code{awk} has a special kind of @code{for} 9114statement for scanning an array: 9115 9116@example 9117for (@var{var} in @var{array}) 9118 @var{body} 9119@end example 9120 9121@noindent 9122This loop executes @var{body} once for each index in @var{array} that your 9123program has previously used, with the 9124variable @var{var} set to that index. 9125 9126Here is a program that uses this form of the @code{for} statement. The 9127first rule scans the input records and notes which words appear (at 9128least once) in the input, by storing a one into the array @code{used} with 9129the word as index. The second rule scans the elements of @code{used} to 9130find all the distinct words that appear in the input. It prints each 9131word that is more than 10 characters long, and also prints the number of 9132such words. @xref{String Functions, ,Built-in Functions for String Manipulation}, for more information 9133on the built-in function @code{length}. 9134 9135@example 9136# Record a 1 for each word that is used at least once. 9137@{ 9138 for (i = 1; i <= NF; i++) 9139 used[$i] = 1 9140@} 9141 9142# Find number of distinct words more than 10 characters long. 9143END @{ 9144 for (x in used) 9145 if (length(x) > 10) @{ 9146 ++num_long_words 9147 print x 9148 @} 9149 print num_long_words, "words longer than 10 characters" 9150@} 9151@end example 9152 9153@noindent 9154@xref{Word Sorting, ,Generating Word Usage Counts}, 9155for a more detailed example of this type. 9156 9157The order in which elements of the array are accessed by this statement 9158is determined by the internal arrangement of the array elements within 9159@code{awk} and cannot be controlled or changed. This can lead to 9160problems if new elements are added to @var{array} by statements in 9161the loop body; you cannot predict whether or not the @code{for} loop will 9162reach them. Similarly, changing @var{var} inside the loop may produce 9163strange results. It is best to avoid such things. 9164 9165@node Delete, Numeric Array Subscripts, Scanning an Array, Arrays 9166@section The @code{delete} Statement 9167@cindex @code{delete} statement 9168@cindex deleting elements of arrays 9169@cindex removing elements of arrays 9170@cindex arrays, deleting an element 9171 9172You can remove an individual element of an array using the @code{delete} 9173statement: 9174 9175@example 9176delete @var{array}[@var{index}] 9177@end example 9178 9179Once you have deleted an array element, you can no longer obtain any 9180value the element once had. It is as if you had never referred 9181to it and had never given it any value. 9182 9183Here is an example of deleting elements in an array: 9184 9185@example 9186for (i in frequencies) 9187 delete frequencies[i] 9188@end example 9189 9190@noindent 9191This example removes all the elements from the array @code{frequencies}. 9192 9193If you delete an element, a subsequent @code{for} statement to scan the array 9194will not report that element, and the @code{in} operator to check for 9195the presence of that element will return zero (i.e.@: false): 9196 9197@example 9198delete foo[4] 9199if (4 in foo) 9200 print "This will never be printed" 9201@end example 9202 9203It is important to note that deleting an element is @emph{not} the 9204same as assigning it a null value (the empty string, @code{""}). 9205 9206@example 9207foo[4] = "" 9208if (4 in foo) 9209 print "This is printed, even though foo[4] is empty" 9210@end example 9211 9212It is not an error to delete an element that does not exist. 9213 9214@cindex arrays, deleting entire contents 9215@cindex deleting entire arrays 9216@cindex differences between @code{gawk} and @code{awk} 9217You can delete all the elements of an array with a single statement, 9218by leaving off the subscript in the @code{delete} statement. 9219 9220@example 9221delete @var{array} 9222@end example 9223 9224This ability is a @code{gawk} extension; it is not available in 9225compatibility mode (@pxref{Options, ,Command Line Options}). 9226 9227Using this version of the @code{delete} statement is about three times 9228more efficient than the equivalent loop that deletes each element one 9229at a time. 9230 9231@cindex portability issues 9232The following statement provides a portable, but non-obvious way to clear 9233out an array. 9234 9235@cindex Brennan, Michael 9236@example 9237@group 9238# thanks to Michael Brennan for pointing this out 9239split("", array) 9240@end group 9241@end example 9242 9243The @code{split} function 9244(@pxref{String Functions, ,Built-in Functions for String Manipulation}) 9245clears out the target array first. This call asks it to split 9246apart the null string. Since there is no data to split out, the 9247function simply clears the array and then returns. 9248 9249@strong{Caution:} Deleting an array does not change its type; you cannot 9250delete an array and then use the array's name as a scalar. For 9251example, this will not work: 9252 9253@example 9254a[1] = 3; delete a; a = 3 9255@end example 9256 9257@node Numeric Array Subscripts, Uninitialized Subscripts, Delete, Arrays 9258@section Using Numbers to Subscript Arrays 9259 9260An important aspect of arrays to remember is that @emph{array subscripts 9261are always strings}. If you use a numeric value as a subscript, 9262it will be converted to a string value before it is used for subscripting 9263(@pxref{Conversion, ,Conversion of Strings and Numbers}). 9264 9265@cindex conversions, during subscripting 9266@cindex numbers, used as subscripts 9267@vindex CONVFMT 9268This means that the value of the built-in variable @code{CONVFMT} can potentially 9269affect how your program accesses elements of an array. For example: 9270 9271@example 9272xyz = 12.153 9273data[xyz] = 1 9274CONVFMT = "%2.2f" 9275@group 9276if (xyz in data) 9277 printf "%s is in data\n", xyz 9278else 9279 printf "%s is not in data\n", xyz 9280@end group 9281@end example 9282 9283@noindent 9284This prints @samp{12.15 is not in data}. The first statement gives 9285@code{xyz} a numeric value. Assigning to 9286@code{data[xyz]} subscripts @code{data} with the string value @code{"12.153"} 9287(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}), 9288and assigns one to @code{data["12.153"]}. The program then changes 9289the value of @code{CONVFMT}. The test @samp{(xyz in data)} generates a new 9290string value from @code{xyz}, this time @code{"12.15"}, since the value of 9291@code{CONVFMT} only allows two significant digits. This test fails, 9292since @code{"12.15"} is a different string from @code{"12.153"}. 9293 9294According to the rules for conversions 9295(@pxref{Conversion, ,Conversion of Strings and Numbers}), integer 9296values are always converted to strings as integers, no matter what the 9297value of @code{CONVFMT} may happen to be. So the usual case of: 9298 9299@example 9300for (i = 1; i <= maxsub; i++) 9301 @i{do something with} array[i] 9302@end example 9303 9304@noindent 9305will work, no matter what the value of @code{CONVFMT}. 9306 9307Like many things in @code{awk}, the majority of the time things work 9308as you would expect them to work. But it is useful to have a precise 9309knowledge of the actual rules, since sometimes they can have a subtle 9310effect on your programs. 9311 9312@node Uninitialized Subscripts, Multi-dimensional, Numeric Array Subscripts, Arrays 9313@section Using Uninitialized Variables as Subscripts 9314 9315@cindex uninitialized variables, as array subscripts 9316@cindex array subscripts, uninitialized variables 9317Suppose you want to print your input data in reverse order. 9318A reasonable attempt at a program to do so (with some test 9319data) might look like this: 9320 9321@example 9322@group 9323$ echo 'line 1 9324> line 2 9325> line 3' | awk '@{ l[lines] = $0; ++lines @} 9326> END @{ 9327> for (i = lines-1; i >= 0; --i) 9328> print l[i] 9329> @}' 9330@print{} line 3 9331@print{} line 2 9332@end group 9333@end example 9334 9335Unfortunately, the very first line of input data did not come out in the 9336output! 9337 9338At first glance, this program should have worked. The variable @code{lines} 9339is uninitialized, and uninitialized variables have the numeric value zero. 9340So, @code{awk} should have printed the value of @code{l[0]}. 9341 9342The issue here is that subscripts for @code{awk} arrays are @strong{always} 9343strings. And uninitialized variables, when used as strings, have the 9344value @code{""}, not zero. Thus, @samp{line 1} ended up stored in 9345@code{l[""]}. 9346 9347The following version of the program works correctly: 9348 9349@example 9350@{ l[lines++] = $0 @} 9351END @{ 9352 for (i = lines - 1; i >= 0; --i) 9353 print l[i] 9354@} 9355@end example 9356 9357Here, the @samp{++} forces @code{lines} to be numeric, thus making 9358the ``old value'' numeric zero, which is then converted to @code{"0"} 9359as the array subscript. 9360 9361@cindex null string, as array subscript 9362@cindex dark corner 9363As we have just seen, even though it is somewhat unusual, the null string 9364(@code{""}) is a valid array subscript (d.c.). If @samp{--lint} is provided 9365on the command line (@pxref{Options, ,Command Line Options}), 9366@code{gawk} will warn about the use of the null string as a subscript. 9367 9368@node Multi-dimensional, Multi-scanning, Uninitialized Subscripts, Arrays 9369@section Multi-dimensional Arrays 9370 9371@cindex subscripts in arrays 9372@cindex arrays, multi-dimensional subscripts 9373@cindex multi-dimensional subscripts 9374A multi-dimensional array is an array in which an element is identified 9375by a sequence of indices, instead of a single index. For example, a 9376two-dimensional array requires two indices. The usual way (in most 9377languages, including @code{awk}) to refer to an element of a 9378two-dimensional array named @code{grid} is with 9379@code{grid[@var{x},@var{y}]}. 9380 9381@vindex SUBSEP 9382Multi-dimensional arrays are supported in @code{awk} through 9383concatenation of indices into one string. What happens is that 9384@code{awk} converts the indices into strings 9385(@pxref{Conversion, ,Conversion of Strings and Numbers}) and 9386concatenates them together, with a separator between them. This creates 9387a single string that describes the values of the separate indices. The 9388combined string is used as a single index into an ordinary, 9389one-dimensional array. The separator used is the value of the built-in 9390variable @code{SUBSEP}. 9391 9392For example, suppose we evaluate the expression @samp{foo[5,12] = "value"} 9393when the value of @code{SUBSEP} is @code{"@@"}. The numbers five and 12 are 9394converted to strings and 9395concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus, 9396the array element @code{foo["5@@12"]} is set to @code{"value"}. 9397 9398Once the element's value is stored, @code{awk} has no record of whether 9399it was stored with a single index or a sequence of indices. The two 9400expressions @samp{foo[5,12]} and @w{@samp{foo[5 SUBSEP 12]}} are always 9401equivalent. 9402 9403The default value of @code{SUBSEP} is the string @code{"\034"}, 9404which contains a non-printing character that is unlikely to appear in an 9405@code{awk} program or in most input data. 9406 9407The usefulness of choosing an unlikely character comes from the fact 9408that index values that contain a string matching @code{SUBSEP} lead to 9409combined strings that are ambiguous. Suppose that @code{SUBSEP} were 9410@code{"@@"}; then @w{@samp{foo["a@@b", "c"]}} and @w{@samp{foo["a", 9411"b@@c"]}} would be indistinguishable because both would actually be 9412stored as @samp{foo["a@@b@@c"]}. 9413 9414You can test whether a particular index-sequence exists in a 9415``multi-dimensional'' array with the same operator @samp{in} used for single 9416dimensional arrays. Instead of a single index as the left-hand operand, 9417write the whole sequence of indices, separated by commas, in 9418parentheses: 9419 9420@example 9421(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array} 9422@end example 9423 9424The following example treats its input as a two-dimensional array of 9425fields; it rotates this array 90 degrees clockwise and prints the 9426result. It assumes that all lines have the same number of 9427elements. 9428 9429@example 9430@group 9431awk '@{ 9432 if (max_nf < NF) 9433 max_nf = NF 9434 max_nr = NR 9435 for (x = 1; x <= NF; x++) 9436 vector[x, NR] = $x 9437@} 9438@end group 9439 9440@group 9441END @{ 9442 for (x = 1; x <= max_nf; x++) @{ 9443 for (y = max_nr; y >= 1; --y) 9444 printf("%s ", vector[x, y]) 9445 printf("\n") 9446 @} 9447@}' 9448@end group 9449@end example 9450 9451@noindent 9452When given the input: 9453 9454@example 9455@group 94561 2 3 4 5 6 94572 3 4 5 6 1 94583 4 5 6 1 2 94594 5 6 1 2 3 9460@end group 9461@end example 9462 9463@noindent 9464it produces: 9465 9466@example 9467@group 94684 3 2 1 94695 4 3 2 94706 5 4 3 94711 6 5 4 94722 1 6 5 94733 2 1 6 9474@end group 9475@end example 9476 9477@node Multi-scanning, Array Efficiency, Multi-dimensional, Arrays 9478@section Scanning Multi-dimensional Arrays 9479 9480There is no special @code{for} statement for scanning a 9481``multi-dimensional'' array; there cannot be one, because in truth there 9482are no multi-dimensional arrays or elements; there is only a 9483multi-dimensional @emph{way of accessing} an array. 9484 9485However, if your program has an array that is always accessed as 9486multi-dimensional, you can get the effect of scanning it by combining 9487the scanning @code{for} statement 9488(@pxref{Scanning an Array, ,Scanning All Elements of an Array}) with the 9489@code{split} built-in function 9490(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 9491It works like this: 9492 9493@example 9494for (combined in array) @{ 9495 split(combined, separate, SUBSEP) 9496 @dots{} 9497@} 9498@end example 9499 9500@noindent 9501This sets @code{combined} to 9502each concatenated, combined index in the array, and splits it 9503into the individual indices by breaking it apart where the value of 9504@code{SUBSEP} appears. The split-out indices become the elements of 9505the array @code{separate}. 9506 9507Thus, suppose you have previously stored a value in @code{array[1, "foo"]}; 9508then an element with index @code{"1\034foo"} exists in 9509@code{array}. (Recall that the default value of @code{SUBSEP} is 9510the character with code 034.) Sooner or later the @code{for} statement 9511will find that index and do an iteration with @code{combined} set to 9512@code{"1\034foo"}. Then the @code{split} function is called as 9513follows: 9514 9515@example 9516split("1\034foo", separate, "\034") 9517@end example 9518 9519@noindent 9520The result of this is to set @code{separate[1]} to @code{"1"} and 9521@code{separate[2]} to @code{"foo"}. Presto, the original sequence of 9522separate indices has been recovered. 9523 9524@node Array Efficiency, , Multi-scanning, Arrays 9525@section Using Array Memory Efficiently 9526 9527This section applies just to @code{gawk}. 9528 9529It is often useful to use the same bit of data as an index 9530into multiple arrays. 9531Due to the way @code{gawk} implements associative arrays, 9532when you need to use input data as an index for multiple 9533arrays, it is much more effecient to assign the input field 9534to a separate variable, and then use that variable as the index. 9535 9536@example 9537@{ 9538 name = $1 9539 ssn = $2 9540 nkids = $3 9541 @dots{} 9542 seniority[name]++ # better than seniority[$1]++ 9543 kids[name] = nkids # better than kids[$1] = nkids 9544@} 9545@end example 9546 9547Using separate variables with mnemonic names for the input fields 9548makes programs more readable, in any case. 9549It is an eventual goal to make @code{gawk}'s array indexing as efficient 9550as possible, no matter what the source of the index value. 9551 9552@node Built-in, User-defined, Arrays, Top 9553@chapter Built-in Functions 9554 9555@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!! 9556@cindex built-in functions 9557@dfn{Built-in} functions are functions that are always available for 9558your @code{awk} program to call. This chapter defines all the built-in 9559functions in @code{awk}; some of them are mentioned in other sections, 9560but they are summarized here for your convenience. (You can also define 9561new functions yourself. @xref{User-defined, ,User-defined Functions}.) 9562 9563@menu 9564* Calling Built-in:: How to call built-in functions. 9565* Numeric Functions:: Functions that work with numbers, including 9566 @code{int}, @code{sin} and @code{rand}. 9567* String Functions:: Functions for string manipulation, such as 9568 @code{split}, @code{match}, and 9569 @code{sprintf}. 9570* I/O Functions:: Functions for files and shell commands. 9571* Time Functions:: Functions for dealing with time stamps. 9572@end menu 9573 9574@node Calling Built-in, Numeric Functions, Built-in, Built-in 9575@section Calling Built-in Functions 9576 9577To call a built-in function, write the name of the function followed 9578by arguments in parentheses. For example, @samp{atan2(y + z, 1)} 9579is a call to the function @code{atan2}, with two arguments. 9580 9581Whitespace is ignored between the built-in function name and the 9582open-parenthesis, but we recommend that you avoid using whitespace 9583there. User-defined functions do not permit whitespace in this way, and 9584you will find it easier to avoid mistakes by following a simple 9585convention which always works: no whitespace after a function name. 9586 9587@cindex differences between @code{gawk} and @code{awk} 9588Each built-in function accepts a certain number of arguments. 9589In some cases, arguments can be omitted. The defaults for omitted 9590arguments vary from function to function and are described under the 9591individual functions. In some @code{awk} implementations, extra 9592arguments given to built-in functions are ignored. However, in @code{gawk}, 9593it is a fatal error to give extra arguments to a built-in function. 9594 9595When a function is called, expressions that create the function's actual 9596parameters are evaluated completely before the function call is performed. 9597For example, in the code fragment: 9598 9599@example 9600i = 4 9601j = sqrt(i++) 9602@end example 9603 9604@noindent 9605the variable @code{i} is set to five before @code{sqrt} is called 9606with a value of four for its actual parameter. 9607 9608@cindex evaluation, order of 9609@cindex order of evaluation 9610The order of evaluation of the expressions used for the function's 9611parameters is undefined. Thus, you should not write programs that 9612assume that parameters are evaluated from left to right or from 9613right to left. For example, 9614 9615@example 9616i = 5 9617j = atan2(i++, i *= 2) 9618@end example 9619 9620If the order of evaluation is left to right, then @code{i} first becomes 9621six, and then 12, and @code{atan2} is called with the two arguments six 9622and 12. But if the order of evaluation is right to left, @code{i} 9623first becomes 10, and then 11, and @code{atan2} is called with the 9624two arguments 11 and 10. 9625 9626@node Numeric Functions, String Functions, Calling Built-in, Built-in 9627@section Numeric Built-in Functions 9628 9629Here is a full list of built-in functions that work with numbers. 9630Optional parameters are enclosed in square brackets (``['' and ``]''). 9631 9632@table @code 9633@item int(@var{x}) 9634@findex int 9635This produces the nearest integer to @var{x}, located between @var{x} and zero, 9636truncated toward zero. 9637 9638For example, @code{int(3)} is three, @code{int(3.9)} is three, @code{int(-3.9)} 9639is @minus{}3, and @code{int(-3)} is @minus{}3 as well. 9640 9641@item sqrt(@var{x}) 9642@findex sqrt 9643This gives you the positive square root of @var{x}. It reports an error 9644if @var{x} is negative. Thus, @code{sqrt(4)} is two. 9645 9646@item exp(@var{x}) 9647@findex exp 9648This gives you the exponential of @var{x} (@code{e ^ @var{x}}), or reports 9649an error if @var{x} is out of range. The range of values @var{x} can have 9650depends on your machine's floating point representation. 9651 9652@item log(@var{x}) 9653@findex log 9654This gives you the natural logarithm of @var{x}, if @var{x} is positive; 9655otherwise, it reports an error. 9656 9657@item sin(@var{x}) 9658@findex sin 9659This gives you the sine of @var{x}, with @var{x} in radians. 9660 9661@item cos(@var{x}) 9662@findex cos 9663This gives you the cosine of @var{x}, with @var{x} in radians. 9664 9665@item atan2(@var{y}, @var{x}) 9666@findex atan2 9667This gives you the arctangent of @code{@var{y} / @var{x}} in radians. 9668 9669@item rand() 9670@findex rand 9671This gives you a random number. The values of @code{rand} are 9672uniformly-distributed between zero and one. 9673The value is never zero and never one. 9674 9675Often you want random integers instead. Here is a user-defined function 9676you can use to obtain a random non-negative integer less than @var{n}: 9677 9678@example 9679function randint(n) @{ 9680 return int(n * rand()) 9681@} 9682@end example 9683 9684@noindent 9685The multiplication produces a random number greater than zero and less 9686than @code{n}. We then make it an integer (using @code{int}) between zero 9687and @code{n} @minus{} 1, inclusive. 9688 9689Here is an example where a similar function is used to produce 9690random integers between one and @var{n}. This program 9691prints a new random number for each input record. 9692 9693@example 9694@group 9695awk ' 9696# Function to roll a simulated die. 9697function roll(n) @{ return 1 + int(rand() * n) @} 9698@end group 9699 9700@group 9701# Roll 3 six-sided dice and 9702# print total number of points. 9703@{ 9704 printf("%d points\n", 9705 roll(6)+roll(6)+roll(6)) 9706@}' 9707@end group 9708@end example 9709 9710@cindex seed for random numbers 9711@cindex random numbers, seed of 9712@comment MAWK uses a different seed each time. 9713@strong{Caution:} In most @code{awk} implementations, including @code{gawk}, 9714@code{rand} starts generating numbers from the same 9715starting number, or @dfn{seed}, each time you run @code{awk}. Thus, 9716a program will generate the same results each time you run it. 9717The numbers are random within one @code{awk} run, but predictable 9718from run to run. This is convenient for debugging, but if you want 9719a program to do different things each time it is used, you must change 9720the seed to a value that will be different in each run. To do this, 9721use @code{srand}. 9722 9723@item srand(@r{[}@var{x}@r{]}) 9724@findex srand 9725The function @code{srand} sets the starting point, or seed, 9726for generating random numbers to the value @var{x}. 9727 9728Each seed value leads to a particular sequence of random 9729numbers.@footnote{Computer generated random numbers really are not truly 9730random. They are technically known as ``pseudo-random.'' This means 9731that while the numbers in a sequence appear to be random, you can in 9732fact generate the same sequence of random numbers over and over again.} 9733Thus, if you set the seed to the same value a second time, you will get 9734the same sequence of random numbers again. 9735 9736If you omit the argument @var{x}, as in @code{srand()}, then the current 9737date and time of day are used for a seed. This is the way to get random 9738numbers that are truly unpredictable. 9739 9740The return value of @code{srand} is the previous seed. This makes it 9741easy to keep track of the seeds for use in consistently reproducing 9742sequences of random numbers. 9743@end table 9744 9745@node String Functions, I/O Functions, Numeric Functions, Built-in 9746@section Built-in Functions for String Manipulation 9747 9748The functions in this section look at or change the text of one or more 9749strings. 9750Optional parameters are enclosed in square brackets (``['' and ``]''). 9751 9752@table @code 9753@item index(@var{in}, @var{find}) 9754@findex index 9755This searches the string @var{in} for the first occurrence of the string 9756@var{find}, and returns the position in characters where that occurrence 9757begins in the string @var{in}. For example: 9758 9759@example 9760$ awk 'BEGIN @{ print index("peanut", "an") @}' 9761@print{} 3 9762@end example 9763 9764@noindent 9765If @var{find} is not found, @code{index} returns zero. 9766(Remember that string indices in @code{awk} start at one.) 9767 9768@item length(@r{[}@var{string}@r{]}) 9769@findex length 9770This gives you the number of characters in @var{string}. If 9771@var{string} is a number, the length of the digit string representing 9772that number is returned. For example, @code{length("abcde")} is five. By 9773contrast, @code{length(15 * 35)} works out to three. How? Well, 15 * 35 = 9774525, and 525 is then converted to the string @code{"525"}, which has 9775three characters. 9776 9777If no argument is supplied, @code{length} returns the length of @code{$0}. 9778 9779@cindex historical features 9780@cindex portability issues 9781@cindex @code{awk} language, POSIX version 9782@cindex POSIX @code{awk} 9783In older versions of @code{awk}, you could call the @code{length} function 9784without any parentheses. Doing so is marked as ``deprecated'' in the 9785POSIX standard. This means that while you can do this in your 9786programs, it is a feature that can eventually be removed from a future 9787version of the standard. Therefore, for maximal portability of your 9788@code{awk} programs, you should always supply the parentheses. 9789 9790@item match(@var{string}, @var{regexp}) 9791@findex match 9792The @code{match} function searches the string, @var{string}, for the 9793longest, leftmost substring matched by the regular expression, 9794@var{regexp}. It returns the character position, or @dfn{index}, of 9795where that substring begins (one, if it starts at the beginning of 9796@var{string}). If no match is found, it returns zero. 9797 9798@vindex RSTART 9799@vindex RLENGTH 9800The @code{match} function sets the built-in variable @code{RSTART} to 9801the index. It also sets the built-in variable @code{RLENGTH} to the 9802length in characters of the matched substring. If no match is found, 9803@code{RSTART} is set to zero, and @code{RLENGTH} to @minus{}1. 9804 9805For example: 9806 9807@example 9808@group 9809@c file eg/misc/findpat.sh 9810awk '@{ 9811 if ($1 == "FIND") 9812 regex = $2 9813 else @{ 9814 where = match($0, regex) 9815 if (where != 0) 9816 print "Match of", regex, "found at", \ 9817 where, "in", $0 9818 @} 9819@}' 9820@c endfile 9821@end group 9822@end example 9823 9824@noindent 9825This program looks for lines that match the regular expression stored in 9826the variable @code{regex}. This regular expression can be changed. If the 9827first word on a line is @samp{FIND}, @code{regex} is changed to be the 9828second word on that line. Therefore, given: 9829 9830@example 9831@c file eg/misc/findpat.data 9832FIND ru+n 9833My program runs 9834but not very quickly 9835FIND Melvin 9836JF+KM 9837This line is property of Reality Engineering Co. 9838Melvin was here. 9839@c endfile 9840@end example 9841 9842@noindent 9843@code{awk} prints: 9844 9845@example 9846Match of ru+n found at 12 in My program runs 9847Match of Melvin found at 1 in Melvin was here. 9848@end example 9849 9850@item split(@var{string}, @var{array} @r{[}, @var{fieldsep}@r{]}) 9851@findex split 9852This divides @var{string} into pieces separated by @var{fieldsep}, 9853and stores the pieces in @var{array}. The first piece is stored in 9854@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so 9855forth. The string value of the third argument, @var{fieldsep}, is 9856a regexp describing where to split @var{string} (much as @code{FS} can 9857be a regexp describing where to split input records). If 9858the @var{fieldsep} is omitted, the value of @code{FS} is used. 9859@code{split} returns the number of elements created. 9860 9861The @code{split} function splits strings into pieces in a 9862manner similar to the way input lines are split into fields. For example: 9863 9864@example 9865split("cul-de-sac", a, "-") 9866@end example 9867 9868@noindent 9869splits the string @samp{cul-de-sac} into three fields using @samp{-} as the 9870separator. It sets the contents of the array @code{a} as follows: 9871 9872@example 9873a[1] = "cul" 9874a[2] = "de" 9875a[3] = "sac" 9876@end example 9877 9878@noindent 9879The value returned by this call to @code{split} is three. 9880 9881As with input field-splitting, when the value of @var{fieldsep} is 9882@w{@code{" "}}, leading and trailing whitespace is ignored, and the elements 9883are separated by runs of whitespace. 9884 9885@cindex differences between @code{gawk} and @code{awk} 9886Also as with input field-splitting, if @var{fieldsep} is the null string, each 9887individual character in the string is split into its own array element. 9888(This is a @code{gawk}-specific extension.) 9889 9890@cindex dark corner 9891Recent implementations of @code{awk}, including @code{gawk}, allow 9892the third argument to be a regexp constant (@code{/abc/}), as well as a 9893string (d.c.). The POSIX standard allows this as well. 9894 9895Before splitting the string, @code{split} deletes any previously existing 9896elements in the array @var{array} (d.c.). 9897 9898If @var{string} does not match @var{fieldsep} at all, @var{array} will have 9899one element. The value of that element will be the original 9900@var{string}. 9901 9902@item sprintf(@var{format}, @var{expression1},@dots{}) 9903@findex sprintf 9904This returns (without printing) the string that @code{printf} would 9905have printed out with the same arguments 9906(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}). 9907For example: 9908 9909@example 9910sprintf("pi = %.2f (approx.)", 22/7) 9911@end example 9912 9913@noindent 9914returns the string @w{@code{"pi = 3.14 (approx.)"}}. 9915 9916@ignore 99172e: For sub, gsub, and gensub, either here or in the "how much matches" 9918 section, we need some explanation that it is possible to match the 9919 null string when using closures like *. E.g., 9920 9921 $ echo abc | awk '{ gsub(/m*/, "X"); print }' 9922 @print{} XaXbXcX 9923 9924 Although this makes a certain amount of sense, it can be very 9925 suprising. 9926@end ignore 9927 9928@item sub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]}) 9929@findex sub 9930The @code{sub} function alters the value of @var{target}. 9931It searches this value, which is treated as a string, for the 9932leftmost longest substring matched by the regular expression, @var{regexp}, 9933extending this match as far as possible. Then the entire string is 9934changed by replacing the matched text with @var{replacement}. 9935The modified string becomes the new value of @var{target}. 9936 9937This function is peculiar because @var{target} is not simply 9938used to compute a value, and not just any expression will do: it 9939must be a variable, field or array element, so that @code{sub} can 9940store a modified value there. If this argument is omitted, then the 9941default is to use and alter @code{$0}. 9942 9943For example: 9944 9945@example 9946str = "water, water, everywhere" 9947sub(/at/, "ith", str) 9948@end example 9949 9950@noindent 9951sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the 9952leftmost, longest occurrence of @samp{at} with @samp{ith}. 9953 9954The @code{sub} function returns the number of substitutions made (either 9955one or zero). 9956 9957If the special character @samp{&} appears in @var{replacement}, it 9958stands for the precise substring that was matched by @var{regexp}. (If 9959the regexp can match more than one string, then this precise substring 9960may vary.) For example: 9961 9962@example 9963awk '@{ sub(/candidate/, "& and his wife"); print @}' 9964@end example 9965 9966@noindent 9967changes the first occurrence of @samp{candidate} to @samp{candidate 9968and his wife} on each input line. 9969 9970Here is another example: 9971 9972@example 9973awk 'BEGIN @{ 9974 str = "daabaaa" 9975 sub(/a+/, "C&C", str) 9976 print str 9977@}' 9978@print{} dCaaCbaaa 9979@end example 9980 9981@noindent 9982This shows how @samp{&} can represent a non-constant string, and also 9983illustrates the ``leftmost, longest'' rule in regexp matching 9984(@pxref{Leftmost Longest, ,How Much Text Matches?}). 9985 9986The effect of this special character (@samp{&}) can be turned off by putting a 9987backslash before it in the string. As usual, to insert one backslash in 9988the string, you must write two backslashes. Therefore, write @samp{\\&} 9989in a string constant to include a literal @samp{&} in the replacement. 9990For example, here is how to replace the first @samp{|} on each line with 9991an @samp{&}: 9992 9993@example 9994awk '@{ sub(/\|/, "\\&"); print @}' 9995@end example 9996 9997@cindex @code{sub}, third argument of 9998@cindex @code{gsub}, third argument of 9999@strong{Note:} As mentioned above, the third argument to @code{sub} must 10000be a variable, field or array reference. 10001Some versions of @code{awk} allow the third argument to 10002be an expression which is not an lvalue. In such a case, @code{sub} 10003would still search for the pattern and return zero or one, but the result of 10004the substitution (if any) would be thrown away because there is no place 10005to put it. Such versions of @code{awk} accept expressions like 10006this: 10007 10008@example 10009sub(/USA/, "United States", "the USA and Canada") 10010@end example 10011 10012@noindent 10013For historical compatibility, @code{gawk} will accept erroneous code, 10014such as in the above example. However, using any other non-changeable 10015object as the third parameter will cause a fatal error, and your program 10016will not run. 10017 10018Finally, if the @var{regexp} is not a regexp constant, it is converted into a 10019string and then the value of that string is treated as the regexp to match. 10020 10021@item gsub(@var{regexp}, @var{replacement} @r{[}, @var{target}@r{]}) 10022@findex gsub 10023This is similar to the @code{sub} function, except @code{gsub} replaces 10024@emph{all} of the longest, leftmost, @emph{non-overlapping} matching 10025substrings it can find. The @samp{g} in @code{gsub} stands for 10026``global,'' which means replace everywhere. For example: 10027 10028@example 10029awk '@{ gsub(/Britain/, "United Kingdom"); print @}' 10030@end example 10031 10032@noindent 10033replaces all occurrences of the string @samp{Britain} with @samp{United 10034Kingdom} for all input records. 10035 10036The @code{gsub} function returns the number of substitutions made. If 10037the variable to be searched and altered, @var{target}, is 10038omitted, then the entire input record, @code{$0}, is used. 10039 10040As in @code{sub}, the characters @samp{&} and @samp{\} are special, 10041and the third argument must be an lvalue. 10042@end table 10043 10044@table @code 10045@item gensub(@var{regexp}, @var{replacement}, @var{how} @r{[}, @var{target}@r{]}) 10046@findex gensub 10047@code{gensub} is a general substitution function. Like @code{sub} and 10048@code{gsub}, it searches the target string @var{target} for matches of 10049the regular expression @var{regexp}. Unlike @code{sub} and 10050@code{gsub}, the modified string is returned as the result of the 10051function, and the original target string is @emph{not} changed. If 10052@var{how} is a string beginning with @samp{g} or @samp{G}, then it 10053replaces all matches of @var{regexp} with @var{replacement}. 10054Otherwise, @var{how} is a number indicating which match of @var{regexp} 10055to replace. If no @var{target} is supplied, @code{$0} is used instead. 10056 10057@code{gensub} provides an additional feature that is not available 10058in @code{sub} or @code{gsub}: the ability to specify components of 10059a regexp in the replacement text. This is done by using parentheses 10060in the regexp to mark the components, and then specifying @samp{\@var{n}} 10061in the replacement text, where @var{n} is a digit from one to nine. 10062For example: 10063 10064@example 10065@group 10066$ gawk ' 10067> BEGIN @{ 10068> a = "abc def" 10069> b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) 10070> print b 10071> @}' 10072@print{} def abc 10073@end group 10074@end example 10075 10076@noindent 10077As described above for @code{sub}, you must type two backslashes in order 10078to get one into the string. 10079 10080In the replacement text, the sequence @samp{\0} represents the entire 10081matched text, as does the character @samp{&}. 10082 10083This example shows how you can use the third argument to control 10084which match of the regexp should be changed. 10085 10086@example 10087$ echo a b c a b c | 10088> gawk '@{ print gensub(/a/, "AA", 2) @}' 10089@print{} a b c AA b c 10090@end example 10091 10092In this case, @code{$0} is used as the default target string. 10093@code{gensub} returns the new string as its result, which is 10094passed directly to @code{print} for printing. 10095 10096If the @var{how} argument is a string that does not begin with @samp{g} or 10097@samp{G}, or if it is a number that is less than zero, only one 10098substitution is performed. 10099 10100If @var{regexp} does not match @var{target}, @code{gensub}'s return value 10101is the original, unchanged value of @var{target}. 10102 10103@cindex differences between @code{gawk} and @code{awk} 10104@code{gensub} is a @code{gawk} extension; it is not available 10105in compatibility mode (@pxref{Options, ,Command Line Options}). 10106 10107@item substr(@var{string}, @var{start} @r{[}, @var{length}@r{]}) 10108@findex substr 10109This returns a @var{length}-character-long substring of @var{string}, 10110starting at character number @var{start}. The first character of a 10111string is character number one. For example, 10112@code{substr("washington", 5, 3)} returns @code{"ing"}. 10113 10114If @var{length} is not present, this function returns the whole suffix of 10115@var{string} that begins at character number @var{start}. For example, 10116@code{substr("washington", 5)} returns @code{"ington"}. The whole 10117suffix is also returned 10118if @var{length} is greater than the number of characters remaining 10119in the string, counting from character number @var{start}. 10120 10121@strong{Note:} The string returned by @code{substr} @emph{cannot} be 10122assigned to. Thus, it is a mistake to attempt to change a portion of 10123a string, like this: 10124 10125@example 10126string = "abcdef" 10127# try to get "abCDEf", won't work 10128substr(string, 3, 3) = "CDE" 10129@end example 10130 10131@noindent 10132or to use @code{substr} as the third agument of @code{sub} or @code{gsub}: 10133 10134@example 10135gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG 10136@end example 10137 10138@cindex case conversion 10139@cindex conversion of case 10140@item tolower(@var{string}) 10141@findex tolower 10142This returns a copy of @var{string}, with each upper-case character 10143in the string replaced with its corresponding lower-case character. 10144Non-alphabetic characters are left unchanged. For example, 10145@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}. 10146 10147@item toupper(@var{string}) 10148@findex toupper 10149This returns a copy of @var{string}, with each lower-case character 10150in the string replaced with its corresponding upper-case character. 10151Non-alphabetic characters are left unchanged. For example, 10152@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}. 10153@end table 10154 10155@c fakenode --- for prepinfo 10156@subheading More About @samp{\} and @samp{&} with @code{sub}, @code{gsub} and @code{gensub} 10157 10158@cindex escape processing, @code{sub} et. al. 10159When using @code{sub}, @code{gsub} or @code{gensub}, and trying to get literal 10160backslashes and ampersands into the replacement text, you need to remember 10161that there are several levels of @dfn{escape processing} going on. 10162 10163First, there is the @dfn{lexical} level, which is when @code{awk} reads 10164your program, and builds an internal copy of your program that can 10165be executed. 10166 10167Then there is the run-time level, when @code{awk} actually scans the 10168replacement string to determine what to generate. 10169 10170At both levels, @code{awk} looks for a defined set of characters that 10171can come after a backslash. At the lexical level, it looks for the 10172escape sequences listed in @ref{Escape Sequences}. 10173Thus, for every @samp{\} that @code{awk} will process at the run-time 10174level, you type two @samp{\}s at the lexical level. 10175When a character that is not valid for an escape sequence follows the 10176@samp{\}, Unix @code{awk} and @code{gawk} both simply remove the initial 10177@samp{\}, and put the following character into the string. Thus, for 10178example, @code{"a\qb"} is treated as @code{"aqb"}. 10179 10180At the run-time level, the various functions handle sequences of 10181@samp{\} and @samp{&} differently. The situation is (sadly) somewhat complex. 10182 10183Historically, the @code{sub} and @code{gsub} functions treated the two 10184character sequence @samp{\&} specially; this sequence was replaced in 10185the generated text with a single @samp{&}. Any other @samp{\} within 10186the @var{replacement} string that did not precede an @samp{&} was passed 10187through unchanged. To illustrate with a table: 10188 10189@c Thank to Karl Berry for help with the TeX stuff. 10190@tex 10191\vbox{\bigskip 10192% This table has lots of &'s and \'s, so unspecialize them. 10193\catcode`\& = \other \catcode`\\ = \other 10194% But then we need character for escape and tab. 10195@catcode`! = 4 10196@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr 10197 You type!@code{sub} sees!@code{sub} generates@cr 10198@hrulefill!@hrulefill!@hrulefill@cr 10199 @code{\&}! @code{&}!the matched text@cr 10200 @code{\\&}! @code{\&}!a literal @samp{&}@cr 10201 @code{\\\&}! @code{\&}!a literal @samp{&}@cr 10202@code{\\\\&}! @code{\\&}!a literal @samp{\&}@cr 10203@code{\\\\\&}! @code{\\&}!a literal @samp{\&}@cr 10204@code{\\\\\\&}! @code{\\\&}!a literal @samp{\\&}@cr 10205 @code{\\q}! @code{\q}!a literal @samp{\q}@cr 10206} 10207@bigskip} 10208@end tex 10209@ifinfo 10210@display 10211 You type @code{sub} sees @code{sub} generates 10212 -------- ---------- --------------- 10213 @code{\&} @code{&} the matched text 10214 @code{\\&} @code{\&} a literal @samp{&} 10215 @code{\\\&} @code{\&} a literal @samp{&} 10216 @code{\\\\&} @code{\\&} a literal @samp{\&} 10217 @code{\\\\\&} @code{\\&} a literal @samp{\&} 10218@code{\\\\\\&} @code{\\\&} a literal @samp{\\&} 10219 @code{\\q} @code{\q} a literal @samp{\q} 10220@end display 10221@end ifinfo 10222 10223@noindent 10224This table shows both the lexical level processing, where 10225an odd number of backslashes becomes an even number at the run time level, 10226and the run-time processing done by @code{sub}. 10227(For the sake of simplicity, the rest of the tables below only show the 10228case of even numbers of @samp{\}s entered at the lexical level.) 10229 10230The problem with the historical approach is that there is no way to get 10231a literal @samp{\} followed by the matched text. 10232 10233@cindex @code{awk} language, POSIX version 10234@cindex POSIX @code{awk} 10235The 1992 POSIX standard attempted to fix this problem. The standard 10236says that @code{sub} and @code{gsub} look for either a @samp{\} or an @samp{&} 10237after the @samp{\}. If either one follows a @samp{\}, that character is 10238output literally. The interpretation of @samp{\} and @samp{&} then becomes 10239like this: 10240 10241@c thanks to Karl Berry for formatting this table 10242@tex 10243\vbox{\bigskip 10244% This table has lots of &'s and \'s, so unspecialize them. 10245\catcode`\& = \other \catcode`\\ = \other 10246% But then we need character for escape and tab. 10247@catcode`! = 4 10248@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr 10249 You type!@code{sub} sees!@code{sub} generates@cr 10250@hrulefill!@hrulefill!@hrulefill@cr 10251 @code{&}! @code{&}!the matched text@cr 10252 @code{\\&}! @code{\&}!a literal @samp{&}@cr 10253@code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr 10254@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr 10255} 10256@bigskip} 10257@end tex 10258@ifinfo 10259@display 10260 You type @code{sub} sees @code{sub} generates 10261 -------- ---------- --------------- 10262 @code{&} @code{&} the matched text 10263 @code{\\&} @code{\&} a literal @samp{&} 10264 @code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text 10265@code{\\\\\\&} @code{\\\&} a literal @samp{\&} 10266@end display 10267@end ifinfo 10268 10269@noindent 10270This would appear to solve the problem. 10271Unfortunately, the phrasing of the standard is unusual. It 10272says, in effect, that @samp{\} turns off the special meaning of any 10273following character, but that for anything other than @samp{\} and @samp{&}, 10274such special meaning is undefined. This wording leads to two problems. 10275 10276@enumerate 10277@item 10278Backslashes must now be doubled in the @var{replacement} string, breaking 10279historical @code{awk} programs. 10280 10281@item 10282To make sure that an @code{awk} program is portable, @emph{every} character 10283in the @var{replacement} string must be preceded with a 10284backslash.@footnote{This consequence was certainly unintended.} 10285@c I can say that, 'cause I was involved in making this change 10286@end enumerate 10287 10288The POSIX standard is under revision.@footnote{As of @value{UPDATE-MONTH}, 10289with final approval and publication as part of the Austin Group 10290Standards hopefully sometime in 2001.} 10291Because of the above problems, proposed text for the revised standard 10292reverts to rules that correspond more closely to the original existing 10293practice. The proposed rules have special cases that make it possible 10294to produce a @samp{\} preceding the matched text. 10295 10296@tex 10297\vbox{\bigskip 10298% This table has lots of &'s and \'s, so unspecialize them. 10299\catcode`\& = \other \catcode`\\ = \other 10300% But then we need character for escape and tab. 10301@catcode`! = 4 10302@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr 10303 You type!@code{sub} sees!@code{sub} generates@cr 10304@hrulefill!@hrulefill!@hrulefill@cr 10305@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr 10306@code{\\\\&}! @code{\\&}!a literal @samp{\}, followed by the matched text@cr 10307 @code{\\&}! @code{\&}!a literal @samp{&}@cr 10308 @code{\\q}! @code{\q}!a literal @samp{\q}@cr 10309} 10310@bigskip} 10311@end tex 10312@ifinfo 10313@display 10314 You type @code{sub} sees @code{sub} generates 10315 -------- ---------- --------------- 10316@code{\\\\\\&} @code{\\\&} a literal @samp{\&} 10317 @code{\\\\&} @code{\\&} a literal @samp{\}, followed by the matched text 10318 @code{\\&} @code{\&} a literal @samp{&} 10319 @code{\\q} @code{\q} a literal @samp{\q} 10320@end display 10321@end ifinfo 10322 10323In a nutshell, at the run-time level, there are now three special sequences 10324of characters, @samp{\\\&}, @samp{\\&} and @samp{\&}, whereas historically, 10325there was only one. However, as in the historical case, any @samp{\} that 10326is not part of one of these three sequences is not special, and appears 10327in the output literally. 10328 10329@code{gawk} 3.0 follows these proposed POSIX rules for @code{sub} and 10330@code{gsub}. 10331@c As much as we think it's a lousy idea. You win some, you lose some. Sigh. 10332Whether these proposed rules will actually become codified into the 10333standard is unknown at this point. Subsequent @code{gawk} releases will 10334track the standard and implement whatever the final version specifies; 10335this @value{DOCUMENT} will be updated as well. 10336 10337The rules for @code{gensub} are considerably simpler. At the run-time 10338level, whenever @code{gawk} sees a @samp{\}, if the following character 10339is a digit, then the text that matched the corresponding parenthesized 10340subexpression is placed in the generated output. Otherwise, 10341no matter what the character after the @samp{\} is, that character will 10342appear in the generated text, and the @samp{\} will not. 10343 10344@tex 10345\vbox{\bigskip 10346% This table has lots of &'s and \'s, so unspecialize them. 10347\catcode`\& = \other \catcode`\\ = \other 10348% But then we need character for escape and tab. 10349@catcode`! = 4 10350@halign{@hfil#!@qquad@hfil#!@qquad#@hfil@cr 10351 You type!@code{gensub} sees!@code{gensub} generates@cr 10352@hrulefill!@hrulefill!@hrulefill@cr 10353 @code{&}! @code{&}!the matched text@cr 10354 @code{\\&}! @code{\&}!a literal @samp{&}@cr 10355 @code{\\\\}! @code{\\}!a literal @samp{\}@cr 10356 @code{\\\\&}! @code{\\&}!a literal @samp{\}, then the matched text@cr 10357@code{\\\\\\&}! @code{\\\&}!a literal @samp{\&}@cr 10358 @code{\\q}! @code{\q}!a literal @samp{q}@cr 10359} 10360@bigskip} 10361@end tex 10362@ifinfo 10363@display 10364 You type @code{gensub} sees @code{gensub} generates 10365 -------- ------------- ------------------ 10366 @code{&} @code{&} the matched text 10367 @code{\\&} @code{\&} a literal @samp{&} 10368 @code{\\\\} @code{\\} a literal @samp{\} 10369 @code{\\\\&} @code{\\&} a literal @samp{\}, then the matched text 10370@code{\\\\\\&} @code{\\\&} a literal @samp{\&} 10371 @code{\\q} @code{\q} a literal @samp{q} 10372@end display 10373@end ifinfo 10374 10375Because of the complexity of the lexical and run-time level processing, 10376and the special cases for @code{sub} and @code{gsub}, 10377we recommend the use of @code{gawk} and @code{gensub} for when you have 10378to do substitutions. 10379 10380@node I/O Functions, Time Functions, String Functions, Built-in 10381@section Built-in Functions for Input/Output 10382 10383The following functions are related to Input/Output (I/O). 10384Optional parameters are enclosed in square brackets (``['' and ``]''). 10385 10386@table @code 10387@item close(@var{filename}) 10388@findex close 10389Close the file @var{filename}, for input or output. The argument may 10390alternatively be a shell command that was used for redirecting to or 10391from a pipe; then the pipe is closed. 10392@xref{Close Files And Pipes, ,Closing Input and Output Files and Pipes}, 10393for more information. 10394 10395@item fflush(@r{[}@var{filename}@r{]}) 10396@findex fflush 10397@cindex portability issues 10398@cindex flushing buffers 10399@cindex buffers, flushing 10400@cindex buffering output 10401@cindex output, buffering 10402Flush any buffered output associated @var{filename}, which is either a 10403file opened for writing, or a shell command for redirecting output to 10404a pipe. 10405 10406Many utility programs will @dfn{buffer} their output; they save information 10407to be written to a disk file or terminal in memory, until there is enough 10408for it to be worthwhile to send the data to the ouput device. 10409This is often more efficient than writing 10410every little bit of information as soon as it is ready. However, sometimes 10411it is necessary to force a program to @dfn{flush} its buffers; that is, 10412write the information to its destination, even if a buffer is not full. 10413This is the purpose of the @code{fflush} function; @code{gawk} too 10414buffers its output, and the @code{fflush} function can be used to force 10415@code{gawk} to flush its buffers. 10416 10417@code{fflush} is a recent (1994) addition to the Bell Labs research 10418version of @code{awk}; it is not part of the POSIX standard, and will 10419not be available if @samp{--posix} has been specified on the command 10420line (@pxref{Options, ,Command Line Options}). 10421 10422@code{gawk} extends the @code{fflush} function in two ways. The first 10423is to allow no argument at all. In this case, the buffer for the 10424standard output is flushed. The second way is to allow the null string 10425(@w{@code{""}}) as the argument. In this case, the buffers for 10426@emph{all} open output files and pipes are flushed. 10427 10428@code{fflush} returns zero if the buffer was successfully flushed, 10429and nonzero otherwise. 10430 10431@item system(@var{command}) 10432@findex system 10433@cindex interaction, @code{awk} and other programs 10434The @code{system} function allows the user to execute operating system commands 10435and then return to the @code{awk} program. The @code{system} function 10436executes the command given by the string @var{command}. It returns, as 10437its value, the status returned by the command that was executed. 10438 10439For example, if the following fragment of code is put in your @code{awk} 10440program: 10441 10442@example 10443END @{ 10444 system("date | mail -s 'awk run done' root") 10445@} 10446@end example 10447 10448@noindent 10449the system administrator will be sent mail when the @code{awk} program 10450finishes processing input and begins its end-of-input processing. 10451 10452Note that redirecting @code{print} or @code{printf} into a pipe is often 10453enough to accomplish your task. If you need to run many commands, it 10454will be more efficient to simply print them to a pipe to the shell: 10455 10456@example 10457while (@var{more stuff to do}) 10458 print @var{command} | "/bin/sh" 10459close("/bin/sh") 10460@end example 10461 10462@noindent 10463However, if your @code{awk} 10464program is interactive, @code{system} is useful for cranking up large 10465self-contained programs, such as a shell or an editor. 10466 10467Some operating systems cannot implement the @code{system} function. 10468@code{system} causes a fatal error if it is not supported. 10469@end table 10470 10471@c fakenode --- for prepinfo 10472@subheading Interactive vs. Non-Interactive Buffering 10473@cindex buffering, interactive vs. non-interactive 10474@cindex buffering, non-interactive vs. interactive 10475@cindex interactive buffering vs. non-interactive 10476@cindex non-interactive buffering vs. interactive 10477 10478As a side point, buffering issues can be even more confusing depending 10479upon whether or not your program is @dfn{interactive}, i.e., communicating 10480with a user sitting at a keyboard.@footnote{A program is interactive 10481if the standard output is connected 10482to a terminal device.} 10483 10484Interactive programs generally @dfn{line buffer} their output; they 10485write out every line. Non-interactive programs wait until they have 10486a full buffer, which may be many lines of output. 10487 10488@c Thanks to Walter.Mecky@dresdnerbank.de for this example, and for 10489@c motivating me to write this section. 10490Here is an example of the difference. 10491 10492@example 10493$ awk '@{ print $1 + $2 @}' 104941 1 10495@print{} 2 104962 3 10497@print{} 5 10498@kbd{Control-d} 10499@end example 10500 10501@noindent 10502Each line of output is printed immediately. Compare that behavior 10503with this example. 10504 10505@example 10506$ awk '@{ print $1 + $2 @}' | cat 105071 1 105082 3 10509@kbd{Control-d} 10510@print{} 2 10511@print{} 5 10512@end example 10513 10514@noindent 10515Here, no output is printed until after the @kbd{Control-d} is typed, since 10516it is all buffered, and sent down the pipe to @code{cat} in one shot. 10517 10518@c fakenode --- for prepinfo 10519@subheading Controlling Output Buffering with @code{system} 10520@cindex flushing buffers 10521@cindex buffers, flushing 10522@cindex buffering output 10523@cindex output, buffering 10524 10525The @code{fflush} function provides explicit control over output buffering for 10526individual files and pipes. However, its use is not portable to many other 10527@code{awk} implementations. An alternative method to flush output 10528buffers is by calling @code{system} with a null string as its argument: 10529 10530@example 10531system("") # flush output 10532@end example 10533 10534@noindent 10535@code{gawk} treats this use of the @code{system} function as a special 10536case, and is smart enough not to run a shell (or other command 10537interpreter) with the empty command. Therefore, with @code{gawk}, this 10538idiom is not only useful, it is efficient. While this method should work 10539with other @code{awk} implementations, it will not necessarily avoid 10540starting an unnecessary shell. (Other implementations may only 10541flush the buffer associated with the standard output, and not necessarily 10542all buffered output.) 10543 10544If you think about what a programmer expects, it makes sense that 10545@code{system} should flush any pending output. The following program: 10546 10547@example 10548BEGIN @{ 10549 print "first print" 10550 system("echo system echo") 10551 print "second print" 10552@} 10553@end example 10554 10555@noindent 10556must print 10557 10558@example 10559first print 10560system echo 10561second print 10562@end example 10563 10564@noindent 10565and not 10566 10567@example 10568system echo 10569first print 10570second print 10571@end example 10572 10573If @code{awk} did not flush its buffers before calling @code{system}, the 10574latter (undesirable) output is what you would see. 10575 10576@node Time Functions, , I/O Functions, Built-in 10577@section Functions for Dealing with Time Stamps 10578 10579@cindex timestamps 10580@cindex time of day 10581A common use for @code{awk} programs is the processing of log files 10582containing time stamp information, indicating when a 10583particular log record was written. Many programs log their time stamp 10584in the form returned by the @code{time} system call, which is the 10585number of seconds since a particular epoch. On POSIX systems, 10586it is the number of seconds since Midnight, January 1, 1970, UTC. 10587 10588In order to make it easier to process such log files, and to produce 10589useful reports, @code{gawk} provides two functions for working with time 10590stamps. Both of these are @code{gawk} extensions; they are not specified 10591in the POSIX standard, nor are they in any other known version 10592of @code{awk}. 10593 10594Optional parameters are enclosed in square brackets (``['' and ``]''). 10595 10596@table @code 10597@item systime() 10598@findex systime 10599This function returns the current time as the number of seconds since 10600the system epoch. On POSIX systems, this is the number of seconds 10601since Midnight, January 1, 1970, UTC. It may be a different number on 10602other systems. 10603 10604@item strftime(@r{[}@var{format} @r{[}, @var{timestamp}@r{]]}) 10605@findex strftime 10606This function returns a string. It is similar to the function of the 10607same name in ANSI C. The time specified by @var{timestamp} is used to 10608produce a string, based on the contents of the @var{format} string. 10609The @var{timestamp} is in the same format as the value returned by the 10610@code{systime} function. If no @var{timestamp} argument is supplied, 10611@code{gawk} will use the current time of day as the time stamp. 10612If no @var{format} argument is supplied, @code{strftime} uses 10613@code{@w{"%a %b %d %H:%M:%S %Z %Y"}}. This format string produces 10614output (almost) equivalent to that of the @code{date} utility. 10615(Versions of @code{gawk} prior to 3.0 require the @var{format} argument.) 10616@end table 10617 10618The @code{systime} function allows you to compare a time stamp from a 10619log file with the current time of day. In particular, it is easy to 10620determine how long ago a particular record was logged. It also allows 10621you to produce log records using the ``seconds since the epoch'' format. 10622 10623The @code{strftime} function allows you to easily turn a time stamp 10624into human-readable information. It is similar in nature to the @code{sprintf} 10625function 10626(@pxref{String Functions, ,Built-in Functions for String Manipulation}), 10627in that it copies non-format specification characters verbatim to the 10628returned string, while substituting date and time values for format 10629specifications in the @var{format} string. 10630 10631@code{strftime} is guaranteed by the ANSI C standard to support 10632the following date format specifications: 10633 10634@table @code 10635@item %a 10636The locale's abbreviated weekday name. 10637 10638@item %A 10639The locale's full weekday name. 10640 10641@item %b 10642The locale's abbreviated month name. 10643 10644@item %B 10645The locale's full month name. 10646 10647@item %c 10648The locale's ``appropriate'' date and time representation. 10649 10650@item %d 10651The day of the month as a decimal number (01--31). 10652 10653@item %H 10654The hour (24-hour clock) as a decimal number (00--23). 10655 10656@item %I 10657The hour (12-hour clock) as a decimal number (01--12). 10658 10659@item %j 10660The day of the year as a decimal number (001--366). 10661 10662@item %m 10663The month as a decimal number (01--12). 10664 10665@item %M 10666The minute as a decimal number (00--59). 10667 10668@item %p 10669The locale's equivalent of the AM/PM designations associated 10670with a 12-hour clock. 10671 10672@item %S 10673The second as a decimal number (00--60).@footnote{Occasionally there are 10674minutes in a year with a leap second, which is why the 10675seconds can go up to 60.} 10676 10677@item %U 10678The week number of the year (the first Sunday as the first day of week one) 10679as a decimal number (00--53). 10680 10681@item %w 10682The weekday as a decimal number (0--6). Sunday is day zero. 10683 10684@item %W 10685The week number of the year (the first Monday as the first day of week one) 10686as a decimal number (00--53). 10687 10688@item %x 10689The locale's ``appropriate'' date representation. 10690 10691@item %X 10692The locale's ``appropriate'' time representation. 10693 10694@item %y 10695The year without century as a decimal number (00--99). 10696 10697@item %Y 10698The year with century as a decimal number (e.g., 1995). 10699 10700@item %Z 10701The time zone name or abbreviation, or no characters if 10702no time zone is determinable. 10703 10704@item %% 10705A literal @samp{%}. 10706@end table 10707 10708If a conversion specifier is not one of the above, the behavior is 10709undefined.@footnote{This is because ANSI C leaves the 10710behavior of the C version of @code{strftime} undefined, and @code{gawk} 10711will use the system's version of @code{strftime} if it's there. 10712Typically, the conversion specifier will either not appear in the 10713returned string, or it will appear literally.} 10714 10715@cindex locale, definition of 10716Informally, a @dfn{locale} is the geographic place in which a program 10717is meant to run. For example, a common way to abbreviate the date 10718September 4, 1991 in the United States would be ``9/4/91''. 10719In many countries in Europe, however, it would be abbreviated ``4.9.91''. 10720Thus, the @samp{%x} specification in a @code{"US"} locale might produce 10721@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce 10722@samp{4.9.91}. The ANSI C standard defines a default @code{"C"} 10723locale, which is an environment that is typical of what most C programmers 10724are used to. 10725 10726A public-domain C version of @code{strftime} is supplied with @code{gawk} 10727for systems that are not yet fully ANSI-compliant. If that version is 10728used to compile @code{gawk} (@pxref{Installation, ,Installing @code{gawk}}), 10729then the following additional format specifications are available: 10730 10731@table @code 10732@item %D 10733Equivalent to specifying @samp{%m/%d/%y}. 10734 10735@item %e 10736The day of the month, padded with a space if it is only one digit. 10737 10738@item %h 10739Equivalent to @samp{%b}, above. 10740 10741@item %n 10742A newline character (ASCII LF). 10743 10744@item %r 10745Equivalent to specifying @samp{%I:%M:%S %p}. 10746 10747@item %R 10748Equivalent to specifying @samp{%H:%M}. 10749 10750@item %T 10751Equivalent to specifying @samp{%H:%M:%S}. 10752 10753@item %t 10754A tab character. 10755 10756@item %k 10757The hour (24-hour clock) as a decimal number (0-23). 10758Single digit numbers are padded with a space. 10759 10760@item %l 10761The hour (12-hour clock) as a decimal number (1-12). 10762Single digit numbers are padded with a space. 10763 10764@item %C 10765The century, as a number between 00 and 99. 10766 10767@item %u 10768The weekday as a decimal number 10769[1 (Monday)--7]. 10770 10771@cindex ISO 8601 10772@item %V 10773The week number of the year (the first Monday as the first 10774day of week one) as a decimal number (01--53). 10775The method for determining the week number is as specified by ISO 8601 10776(to wit: if the week containing January 1 has four or more days in the 10777new year, then it is week one, otherwise it is week 53 of the previous year 10778and the next week is week one). 10779 10780@item %G 10781The year with century of the ISO week number, as a decimal number. 10782 10783For example, January 1, 1993, is in week 53 of 1992. Thus, the year 10784of its ISO week number is 1992, even though its year is 1993. 10785Similarly, December 31, 1973, is in week 1 of 1974. Thus, the year 10786of its ISO week number is 1974, even though its year is 1973. 10787 10788@item %g 10789The year without century of the ISO week number, as a decimal number (00--99). 10790 10791@item %Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI 10792@itemx %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy 10793These are ``alternate representations'' for the specifications 10794that use only the second letter (@samp{%c}, @samp{%C}, and so on). 10795They are recognized, but their normal representations are 10796used.@footnote{If you don't understand any of this, don't worry about 10797it; these facilities are meant to make it easier to ``internationalize'' 10798programs.} 10799(These facilitate compliance with the POSIX @code{date} utility.) 10800 10801@item %v 10802The date in VMS format (e.g., 20-JUN-1991). 10803 10804@cindex RFC-822 10805@cindex RFC-1036 10806@item %z 10807The timezone offset in a +HHMM format (e.g., the format necessary to 10808produce RFC-822/RFC-1036 date headers). 10809@end table 10810 10811This example is an @code{awk} implementation of the POSIX 10812@code{date} utility. Normally, the @code{date} utility prints the 10813current date and time of day in a well known format. However, if you 10814provide an argument to it that begins with a @samp{+}, @code{date} 10815will copy non-format specifier characters to the standard output, and 10816will interpret the current time according to the format specifiers in 10817the string. For example: 10818 10819@example 10820$ date '+Today is %A, %B %d, %Y.' 10821@print{} Today is Thursday, July 11, 1991. 10822@end example 10823 10824Here is the @code{gawk} version of the @code{date} utility. 10825It has a shell ``wrapper'', to handle the @samp{-u} option, 10826which requires that @code{date} run as if the time zone 10827was set to UTC. 10828 10829@example 10830@group 10831#! /bin/sh 10832# 10833# date --- approximate the P1003.2 'date' command 10834 10835case $1 in 10836-u) TZ=GMT0 # use UTC 10837 export TZ 10838 shift ;; 10839esac 10840@end group 10841 10842@group 10843gawk 'BEGIN @{ 10844 format = "%a %b %d %H:%M:%S %Z %Y" 10845 exitval = 0 10846@end group 10847 10848@group 10849 if (ARGC > 2) 10850 exitval = 1 10851 else if (ARGC == 2) @{ 10852 format = ARGV[1] 10853 if (format ~ /^\+/) 10854 format = substr(format, 2) # remove leading + 10855 @} 10856 print strftime(format) 10857 exit exitval 10858@}' "$@@" 10859@end group 10860@end example 10861 10862@node User-defined, Invoking Gawk, Built-in, Top 10863@chapter User-defined Functions 10864 10865@cindex user-defined functions 10866@cindex functions, user-defined 10867Complicated @code{awk} programs can often be simplified by defining 10868your own functions. User-defined functions can be called just like 10869built-in ones (@pxref{Function Calls}), but it is up to you to define 10870them---to tell @code{awk} what they should do. 10871 10872@menu 10873* Definition Syntax:: How to write definitions and what they mean. 10874* Function Example:: An example function definition and what it 10875 does. 10876* Function Caveats:: Things to watch out for. 10877* Return Statement:: Specifying the value a function returns. 10878@end menu 10879 10880@node Definition Syntax, Function Example, User-defined, User-defined 10881@section Function Definition Syntax 10882@cindex defining functions 10883@cindex function definition 10884 10885Definitions of functions can appear anywhere between the rules of an 10886@code{awk} program. Thus, the general form of an @code{awk} program is 10887extended to include sequences of rules @emph{and} user-defined function 10888definitions. 10889There is no need in @code{awk} to put the definition of a function 10890before all uses of the function. This is because @code{awk} reads the 10891entire program before starting to execute any of it. 10892 10893The definition of a function named @var{name} looks like this: 10894 10895@example 10896function @var{name}(@var{parameter-list}) 10897@{ 10898 @var{body-of-function} 10899@} 10900@end example 10901 10902@cindex names, use of 10903@cindex namespaces 10904@noindent 10905@var{name} is the name of the function to be defined. A valid function 10906name is like a valid variable name: a sequence of letters, digits and 10907underscores, not starting with a digit. 10908Within a single @code{awk} program, any particular name can only be 10909used as a variable, array or function. 10910 10911@var{parameter-list} is a list of the function's arguments and local 10912variable names, separated by commas. When the function is called, 10913the argument names are used to hold the argument values given in 10914the call. The local variables are initialized to the empty string. 10915A function cannot have two parameters with the same name. 10916 10917The @var{body-of-function} consists of @code{awk} statements. It is the 10918most important part of the definition, because it says what the function 10919should actually @emph{do}. The argument names exist to give the body a 10920way to talk about the arguments; local variables, to give the body 10921places to keep temporary values. 10922 10923Argument names are not distinguished syntactically from local variable 10924names; instead, the number of arguments supplied when the function is 10925called determines how many argument variables there are. Thus, if three 10926argument values are given, the first three names in @var{parameter-list} 10927are arguments, and the rest are local variables. 10928 10929It follows that if the number of arguments is not the same in all calls 10930to the function, some of the names in @var{parameter-list} may be 10931arguments on some occasions and local variables on others. Another 10932way to think of this is that omitted arguments default to the 10933null string. 10934 10935Usually when you write a function you know how many names you intend to 10936use for arguments and how many you intend to use as local variables. It is 10937conventional to place some extra space between the arguments and 10938the local variables, to document how your function is supposed to be used. 10939 10940@cindex variable shadowing 10941During execution of the function body, the arguments and local variable 10942values hide or @dfn{shadow} any variables of the same names used in the 10943rest of the program. The shadowed variables are not accessible in the 10944function definition, because there is no way to name them while their 10945names have been taken away for the local variables. All other variables 10946used in the @code{awk} program can be referenced or set normally in the 10947function's body. 10948 10949The arguments and local variables last only as long as the function body 10950is executing. Once the body finishes, you can once again access the 10951variables that were shadowed while the function was running. 10952 10953@cindex recursive function 10954@cindex function, recursive 10955The function body can contain expressions which call functions. They 10956can even call this function, either directly or by way of another 10957function. When this happens, we say the function is @dfn{recursive}. 10958 10959@cindex @code{awk} language, POSIX version 10960@cindex POSIX @code{awk} 10961In many @code{awk} implementations, including @code{gawk}, 10962the keyword @code{function} may be 10963abbreviated @code{func}. However, POSIX only specifies the use of 10964the keyword @code{function}. This actually has some practical implications. 10965If @code{gawk} is in POSIX-compatibility mode 10966(@pxref{Options, ,Command Line Options}), then the following 10967statement will @emph{not} define a function: 10968 10969@example 10970func foo() @{ a = sqrt($1) ; print a @} 10971@end example 10972 10973@noindent 10974Instead it defines a rule that, for each record, concatenates the value 10975of the variable @samp{func} with the return value of the function @samp{foo}. 10976If the resulting string is non-null, the action is executed. 10977This is probably not what was desired. (@code{awk} accepts this input as 10978syntactically valid, since functions may be used before they are defined 10979in @code{awk} programs.) 10980 10981@cindex portability issues 10982To ensure that your @code{awk} programs are portable, always use the 10983keyword @code{function} when defining a function. 10984 10985@node Function Example, Function Caveats, Definition Syntax, User-defined 10986@section Function Definition Examples 10987 10988Here is an example of a user-defined function, called @code{myprint}, that 10989takes a number and prints it in a specific format. 10990 10991@example 10992function myprint(num) 10993@{ 10994 printf "%6.3g\n", num 10995@} 10996@end example 10997 10998@noindent 10999To illustrate, here is an @code{awk} rule which uses our @code{myprint} 11000function: 11001 11002@example 11003$3 > 0 @{ myprint($3) @} 11004@end example 11005 11006@noindent 11007This program prints, in our special format, all the third fields that 11008contain a positive number in our input. Therefore, when given: 11009 11010@example 11011@group 11012 1.2 3.4 5.6 7.8 11013 9.10 11.12 -13.14 15.16 1101417.18 19.20 21.22 23.24 11015@end group 11016@end example 11017 11018@noindent 11019this program, using our function to format the results, prints: 11020 11021@example 11022 5.6 11023 21.2 11024@end example 11025 11026This function deletes all the elements in an array. 11027 11028@example 11029function delarray(a, i) 11030@{ 11031 for (i in a) 11032 delete a[i] 11033@} 11034@end example 11035 11036When working with arrays, it is often necessary to delete all the elements 11037in an array and start over with a new list of elements 11038(@pxref{Delete, ,The @code{delete} Statement}). 11039Instead of having 11040to repeat this loop everywhere in your program that you need to clear out 11041an array, your program can just call @code{delarray}. 11042(This guarantees portability. The usage @samp{delete @var{array}} to delete 11043the contents of an entire array is a non-standard extension.) 11044 11045Here is an example of a recursive function. It takes a string 11046as an input parameter, and returns the string in backwards order. 11047 11048@example 11049function rev(str, start) 11050@{ 11051 if (start == 0) 11052 return "" 11053 11054 return (substr(str, start, 1) rev(str, start - 1)) 11055@} 11056@end example 11057 11058If this function is in a file named @file{rev.awk}, we can test it 11059this way: 11060 11061@example 11062$ echo "Don't Panic!" | 11063> gawk --source '@{ print rev($0, length($0)) @}' -f rev.awk 11064@print{} !cinaP t'noD 11065@end example 11066 11067Here is an example that uses the built-in function @code{strftime}. 11068(@xref{Time Functions, ,Functions for Dealing with Time Stamps}, 11069for more information on @code{strftime}.) 11070The C @code{ctime} function takes a timestamp and returns it in a string, 11071formatted in a well known fashion. Here is an @code{awk} version: 11072 11073@example 11074@c file eg/lib/ctime.awk 11075# ctime.awk 11076# 11077# awk version of C ctime(3) function 11078 11079@group 11080function ctime(ts, format) 11081@{ 11082 format = "%a %b %d %H:%M:%S %Z %Y" 11083 if (ts == 0) 11084 ts = systime() # use current time as default 11085 return strftime(format, ts) 11086@} 11087@c endfile 11088@end group 11089@end example 11090 11091@node Function Caveats, Return Statement, Function Example, User-defined 11092@section Calling User-defined Functions 11093 11094@cindex call by value 11095@cindex call by reference 11096@cindex calling a function 11097@cindex function call 11098@dfn{Calling a function} means causing the function to run and do its job. 11099A function call is an expression, and its value is the value returned by 11100the function. 11101 11102A function call consists of the function name followed by the arguments 11103in parentheses. What you write in the call for the arguments are 11104@code{awk} expressions; each time the call is executed, these 11105expressions are evaluated, and the values are the actual arguments. For 11106example, here is a call to @code{foo} with three arguments (the first 11107being a string concatenation): 11108 11109@example 11110foo(x y, "lose", 4 * z) 11111@end example 11112 11113@strong{Caution:} whitespace characters (spaces and tabs) are not allowed 11114between the function name and the open-parenthesis of the argument list. 11115If you write whitespace by mistake, @code{awk} might think that you mean 11116to concatenate a variable with an expression in parentheses. However, it 11117notices that you used a function name and not a variable name, and reports 11118an error. 11119 11120@cindex call by value 11121When a function is called, it is given a @emph{copy} of the values of 11122its arguments. This is known as @dfn{call by value}. The caller may use 11123a variable as the expression for the argument, but the called function 11124does not know this: it only knows what value the argument had. For 11125example, if you write this code: 11126 11127@example 11128foo = "bar" 11129z = myfunc(foo) 11130@end example 11131 11132@noindent 11133then you should not think of the argument to @code{myfunc} as being 11134``the variable @code{foo}.'' Instead, think of the argument as the 11135string value, @code{"bar"}. 11136 11137If the function @code{myfunc} alters the values of its local variables, 11138this has no effect on any other variables. Thus, if @code{myfunc} 11139does this: 11140 11141@example 11142@group 11143function myfunc(str) 11144@{ 11145 print str 11146 str = "zzz" 11147 print str 11148@} 11149@end group 11150@end example 11151 11152@noindent 11153to change its first argument variable @code{str}, this @emph{does not} 11154change the value of @code{foo} in the caller. The role of @code{foo} in 11155calling @code{myfunc} ended when its value, @code{"bar"}, was computed. 11156If @code{str} also exists outside of @code{myfunc}, the function body 11157cannot alter this outer value, because it is shadowed during the 11158execution of @code{myfunc} and cannot be seen or changed from there. 11159 11160@cindex call by reference 11161However, when arrays are the parameters to functions, they are @emph{not} 11162copied. Instead, the array itself is made available for direct manipulation 11163by the function. This is usually called @dfn{call by reference}. 11164Changes made to an array parameter inside the body of a function @emph{are} 11165visible outside that function. 11166@ifinfo 11167This can be @strong{very} dangerous if you do not watch what you are 11168doing. For example: 11169@end ifinfo 11170@iftex 11171@emph{This can be very dangerous if you do not watch what you are 11172doing.} For example: 11173@end iftex 11174 11175@example 11176@group 11177function changeit(array, ind, nvalue) 11178@{ 11179 array[ind] = nvalue 11180@} 11181@end group 11182 11183BEGIN @{ 11184 a[1] = 1; a[2] = 2; a[3] = 3 11185 changeit(a, 2, "two") 11186 printf "a[1] = %s, a[2] = %s, a[3] = %s\n", 11187 a[1], a[2], a[3] 11188@} 11189@end example 11190 11191@noindent 11192This program prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because 11193@code{changeit} stores @code{"two"} in the second element of @code{a}. 11194 11195@cindex undefined functions 11196@cindex functions, undefined 11197Some @code{awk} implementations allow you to call a function that 11198has not been defined, and only report a problem at run-time when the 11199program actually tries to call the function. For example: 11200 11201@example 11202@group 11203BEGIN @{ 11204 if (0) 11205 foo() 11206 else 11207 bar() 11208@} 11209function bar() @{ @dots{} @} 11210# note that `foo' is not defined 11211@end group 11212@end example 11213 11214@noindent 11215Since the @samp{if} statement will never be true, it is not really a 11216problem that @code{foo} has not been defined. Usually though, it is a 11217problem if a program calls an undefined function. 11218 11219@ignore 11220At one point, I had gawk dieing on this, but later decided that this might 11221break old programs and/or test suites. 11222@end ignore 11223 11224If @samp{--lint} has been specified 11225(@pxref{Options, ,Command Line Options}), 11226@code{gawk} will report about calls to undefined functions. 11227 11228Some @code{awk} implementations generate a run-time 11229error if you use the @code{next} statement 11230(@pxref{Next Statement, , The @code{next} Statement}) 11231inside a user-defined function. 11232@code{gawk} does not have this problem. 11233 11234@node Return Statement, , Function Caveats, User-defined 11235@section The @code{return} Statement 11236@cindex @code{return} statement 11237 11238The body of a user-defined function can contain a @code{return} statement. 11239This statement returns control to the rest of the @code{awk} program. It 11240can also be used to return a value for use in the rest of the @code{awk} 11241program. It looks like this: 11242 11243@example 11244return @r{[}@var{expression}@r{]} 11245@end example 11246 11247The @var{expression} part is optional. If it is omitted, then the returned 11248value is undefined and, therefore, unpredictable. 11249 11250A @code{return} statement with no value expression is assumed at the end of 11251every function definition. So if control reaches the end of the function 11252body, then the function returns an unpredictable value. @code{awk} 11253will @emph{not} warn you if you use the return value of such a function. 11254 11255Sometimes, you want to write a function for what it does, not for 11256what it returns. Such a function corresponds to a @code{void} function 11257in C or to a @code{procedure} in Pascal. Thus, it may be appropriate to not 11258return any value; you should simply bear in mind that if you use the return 11259value of such a function, you do so at your own risk. 11260 11261Here is an example of a user-defined function that returns a value 11262for the largest number among the elements of an array: 11263 11264@example 11265@group 11266function maxelt(vec, i, ret) 11267@{ 11268 for (i in vec) @{ 11269 if (ret == "" || vec[i] > ret) 11270 ret = vec[i] 11271 @} 11272 return ret 11273@} 11274@end group 11275@end example 11276 11277@noindent 11278You call @code{maxelt} with one argument, which is an array name. The local 11279variables @code{i} and @code{ret} are not intended to be arguments; 11280while there is nothing to stop you from passing two or three arguments 11281to @code{maxelt}, the results would be strange. The extra space before 11282@code{i} in the function parameter list indicates that @code{i} and 11283@code{ret} are not supposed to be arguments. This is a convention that 11284you should follow when you define functions. 11285 11286Here is a program that uses our @code{maxelt} function. It loads an 11287array, calls @code{maxelt}, and then reports the maximum number in that 11288array: 11289 11290@example 11291@group 11292awk ' 11293function maxelt(vec, i, ret) 11294@{ 11295 for (i in vec) @{ 11296 if (ret == "" || vec[i] > ret) 11297 ret = vec[i] 11298 @} 11299 return ret 11300@} 11301@end group 11302 11303@group 11304# Load all fields of each record into nums. 11305@{ 11306 for(i = 1; i <= NF; i++) 11307 nums[NR, i] = $i 11308@} 11309 11310END @{ 11311 print maxelt(nums) 11312@}' 11313@end group 11314@end example 11315 11316Given the following input: 11317 11318@example 11319@group 11320 1 5 23 8 16 1132144 3 5 2 8 26 11322256 291 1396 2962 100 11323-6 467 998 1101 1132499385 11 0 225 11325@end group 11326@end example 11327 11328@noindent 11329our program tells us (predictably) that @code{99385} is the largest number 11330in our array. 11331 11332@node Invoking Gawk, Library Functions, User-defined, Top 11333@chapter Running @code{awk} 11334@cindex command line 11335@cindex invocation of @code{gawk} 11336@cindex arguments, command line 11337@cindex options, command line 11338@cindex long options 11339@cindex options, long 11340 11341There are two ways to run @code{awk}: with an explicit program, or with 11342one or more program files. Here are templates for both of them; items 11343enclosed in @samp{@r{[}@dots{}@r{]}} in these templates are optional. 11344 11345Besides traditional one-letter POSIX-style options, @code{gawk} also 11346supports GNU long options. 11347 11348@example 11349awk @r{[@var{options}]} -f progfile @r{[@code{--}]} @var{file} @dots{} 11350awk @r{[@var{options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{} 11351@end example 11352 11353@cindex empty program 11354@cindex dark corner 11355It is possible to invoke @code{awk} with an empty program: 11356 11357@example 11358$ awk '' datafile1 datafile2 11359@end example 11360 11361@noindent 11362Doing so makes little sense though; @code{awk} will simply exit 11363silently when given an empty program (d.c.). If @samp{--lint} has 11364been specified on the command line, @code{gawk} will issue a 11365warning that the program is empty. 11366 11367@menu 11368* Options:: Command line options and their meanings. 11369* Other Arguments:: Input file names and variable assignments. 11370* AWKPATH Variable:: Searching directories for @code{awk} programs. 11371* Obsolete:: Obsolete Options and/or features. 11372* Undocumented:: Undocumented Options and Features. 11373* Known Bugs:: Known Bugs in @code{gawk}. 11374@end menu 11375 11376@node Options, Other Arguments, Invoking Gawk, Invoking Gawk 11377@section Command Line Options 11378 11379Options begin with a dash, and consist of a single character. 11380GNU style long options consist of two dashes and a keyword. 11381The keyword can be abbreviated, as long the abbreviation allows the option 11382to be uniquely identified. If the option takes an argument, then the 11383keyword is either immediately followed by an equals sign (@samp{=}) and the 11384argument's value, or the keyword and the argument's value are separated 11385by whitespace. For brevity, the discussion below only refers to the 11386traditional short options; however the long and short options are 11387interchangeable in all contexts. 11388 11389Each long option for @code{gawk} has a corresponding 11390POSIX-style option. The options and their meanings are as follows: 11391 11392@table @code 11393@item -F @var{fs} 11394@itemx --field-separator @var{fs} 11395@cindex @code{-F} option 11396@cindex @code{--field-separator} option 11397Sets the @code{FS} variable to @var{fs} 11398(@pxref{Field Separators, ,Specifying How Fields are Separated}). 11399 11400@item -f @var{source-file} 11401@itemx --file @var{source-file} 11402@cindex @code{-f} option 11403@cindex @code{--file} option 11404Indicates that the @code{awk} program is to be found in @var{source-file} 11405instead of in the first non-option argument. 11406 11407@item -v @var{var}=@var{val} 11408@itemx --assign @var{var}=@var{val} 11409@cindex @code{-v} option 11410@cindex @code{--assign} option 11411Sets the variable @var{var} to the value @var{val} @strong{before} 11412execution of the program begins. Such variable values are available 11413inside the @code{BEGIN} rule 11414(@pxref{Other Arguments, ,Other Command Line Arguments}). 11415 11416The @samp{-v} option can only set one variable, but you can use 11417it more than once, setting another variable each time, like this: 11418@samp{awk @w{-v foo=1} @w{-v bar=2} @dots{}}. 11419 11420@strong{Caution:} Using @samp{-v} to set the values of the builtin 11421variables may lead to suprising results. @code{awk} will reset the 11422values of those variables as it needs to, possibly ignoring any 11423predefined value you may have given. 11424 11425@item -mf @var{NNN} 11426@itemx -mr @var{NNN} 11427Set various memory limits to the value @var{NNN}. The @samp{f} flag sets 11428the maximum number of fields, and the @samp{r} flag sets the maximum 11429record size. These two flags and the @samp{-m} option are from the 11430Bell Labs research version of Unix @code{awk}. They are provided 11431for compatibility, but otherwise ignored by 11432@code{gawk}, since @code{gawk} has no predefined limits. 11433 11434@item -W @var{gawk-opt} 11435@cindex @code{-W} option 11436Following the POSIX standard, options that are implementation 11437specific are supplied as arguments to the @samp{-W} option. These options 11438also have corresponding GNU style long options. 11439See below. 11440 11441@item -- 11442Signals the end of the command line options. The following arguments 11443are not treated as options even if they begin with @samp{-}. This 11444interpretation of @samp{--} follows the POSIX argument parsing 11445conventions. 11446 11447This is useful if you have file names that start with @samp{-}, 11448or in shell scripts, if you have file names that will be specified 11449by the user which could start with @samp{-}. 11450@end table 11451 11452The following @code{gawk}-specific options are available: 11453 11454@table @code 11455@item -W traditional 11456@itemx -W compat 11457@itemx --traditional 11458@itemx --compat 11459@cindex @code{--compat} option 11460@cindex @code{--traditional} option 11461@cindex compatibility mode 11462Specifies @dfn{compatibility mode}, in which the GNU extensions to 11463the @code{awk} language are disabled, so that @code{gawk} behaves just 11464like the Bell Labs research version of Unix @code{awk}. 11465@samp{--traditional} is the preferred form of this option. 11466@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}}, 11467which summarizes the extensions. Also see 11468@ref{Compatibility Mode, ,Downward Compatibility and Debugging}. 11469 11470@item -W copyleft 11471@itemx -W copyright 11472@itemx --copyleft 11473@itemx --copyright 11474@cindex @code{--copyleft} option 11475@cindex @code{--copyright} option 11476Print the short version of the General Public License, and then exit. 11477This option may disappear in a future version of @code{gawk}. 11478 11479@item -W help 11480@itemx -W usage 11481@itemx --help 11482@itemx --usage 11483@cindex @code{--help} option 11484@cindex @code{--usage} option 11485Print a ``usage'' message summarizing the short and long style options 11486that @code{gawk} accepts, and then exit. 11487 11488@item -W lint 11489@itemx --lint 11490@cindex @code{--lint} option 11491Warn about constructs that are dubious or non-portable to 11492other @code{awk} implementations. 11493Some warnings are issued when @code{gawk} first reads your program. Others 11494are issued at run-time, as your program executes. 11495 11496@item -W lint-old 11497@itemx --lint-old 11498@cindex @code{--lint-old} option 11499Warn about constructs that are not available in 11500the original Version 7 Unix version of @code{awk} 11501(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}). 11502 11503@item -W posix 11504@itemx --posix 11505@cindex @code{--posix} option 11506@cindex POSIX mode 11507Operate in strict POSIX mode. This disables all @code{gawk} 11508extensions (just like @samp{--traditional}), and adds the following additional 11509restrictions: 11510 11511@c IMPORTANT! Keep this list in sync with the one in node POSIX 11512 11513@itemize @bullet 11514@item 11515@code{\x} escape sequences are not recognized 11516(@pxref{Escape Sequences}). 11517 11518@item 11519Newlines do not act as whitespace to separate fields when @code{FS} is 11520equal to a single space. 11521 11522@item 11523The synonym @code{func} for the keyword @code{function} is not 11524recognized (@pxref{Definition Syntax, ,Function Definition Syntax}). 11525 11526@item 11527The operators @samp{**} and @samp{**=} cannot be used in 11528place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators}, 11529and also @pxref{Assignment Ops, ,Assignment Expressions}). 11530 11531@item 11532Specifying @samp{-Ft} on the command line does not set the value 11533of @code{FS} to be a single tab character 11534(@pxref{Field Separators, ,Specifying How Fields are Separated}). 11535 11536@item 11537The @code{fflush} built-in function is not supported 11538(@pxref{I/O Functions, , Built-in Functions for Input/Output}). 11539@end itemize 11540 11541If you supply both @samp{--traditional} and @samp{--posix} on the 11542command line, @samp{--posix} will take precedence. @code{gawk} 11543will also issue a warning if both options are supplied. 11544 11545@item -W re-interval 11546@itemx --re-interval 11547Allow interval expressions 11548(@pxref{Regexp Operators, , Regular Expression Operators}), 11549in regexps. 11550Because interval expressions were traditionally not available in @code{awk}, 11551@code{gawk} does not provide them by default. This prevents old @code{awk} 11552programs from breaking. 11553 11554@item -W source @var{program-text} 11555@itemx --source @var{program-text} 11556@cindex @code{--source} option 11557Program source code is taken from the @var{program-text}. This option 11558allows you to mix source code in files with source 11559code that you enter on the command line. This is particularly useful 11560when you have library functions that you wish to use from your command line 11561programs (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}). 11562 11563@item -W version 11564@itemx --version 11565@cindex @code{--version} option 11566Prints version information for this particular copy of @code{gawk}. 11567This allows you to determine if your copy of @code{gawk} is up to date 11568with respect to whatever the Free Software Foundation is currently 11569distributing. 11570It is also useful for bug reports 11571(@pxref{Bugs, , Reporting Problems and Bugs}). 11572@end table 11573 11574Any other options are flagged as invalid with a warning message, but 11575are otherwise ignored. 11576 11577In compatibility mode, as a special case, if the value of @var{fs} supplied 11578to the @samp{-F} option is @samp{t}, then @code{FS} is set to the tab 11579character (@code{"\t"}). This is only true for @samp{--traditional}, and not 11580for @samp{--posix} 11581(@pxref{Field Separators, ,Specifying How Fields are Separated}). 11582 11583The @samp{-f} option may be used more than once on the command line. 11584If it is, @code{awk} reads its program source from all of the named files, as 11585if they had been concatenated together into one big file. This is 11586useful for creating libraries of @code{awk} functions. Useful functions 11587can be written once, and then retrieved from a standard place, instead 11588of having to be included into each individual program. 11589 11590You can type in a program at the terminal and still use library functions, 11591by specifying @samp{-f /dev/tty}. @code{awk} will read a file from the terminal 11592to use as part of the @code{awk} program. After typing your program, 11593type @kbd{Control-d} (the end-of-file character) to terminate it. 11594(You may also use @samp{-f -} to read program source from the standard 11595input, but then you will not be able to also use the standard input as a 11596source of data.) 11597 11598Because it is clumsy using the standard @code{awk} mechanisms to mix source 11599file and command line @code{awk} programs, @code{gawk} provides the 11600@samp{--source} option. This does not require you to pre-empt the standard 11601input for your source code, and allows you to easily mix command line 11602and library source code 11603(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}). 11604 11605If no @samp{-f} or @samp{--source} option is specified, then @code{gawk} 11606will use the first non-option command line argument as the text of the 11607program source code. 11608 11609@cindex @code{POSIXLY_CORRECT} environment variable 11610@cindex environment variable, @code{POSIXLY_CORRECT} 11611If the environment variable @code{POSIXLY_CORRECT} exists, 11612then @code{gawk} will behave in strict POSIX mode, exactly as if 11613you had supplied the @samp{--posix} command line option. 11614Many GNU programs look for this environment variable to turn on 11615strict POSIX mode. If you supply @samp{--lint} on the command line, 11616and @code{gawk} turns on POSIX mode because of @code{POSIXLY_CORRECT}, 11617then it will print a warning message indicating that POSIX 11618mode is in effect. 11619 11620You would typically set this variable in your shell's startup file. 11621For a Bourne compatible shell (such as Bash), you would add these 11622lines to the @file{.profile} file in your home directory. 11623 11624@example 11625@group 11626POSIXLY_CORRECT=true 11627export POSIXLY_CORRECT 11628@end group 11629@end example 11630 11631For a @code{csh} compatible shell,@footnote{Not recommended.} 11632you would add this line to the @file{.login} file in your home directory. 11633 11634@example 11635setenv POSIXLY_CORRECT true 11636@end example 11637 11638@node Other Arguments, AWKPATH Variable, Options, Invoking Gawk 11639@section Other Command Line Arguments 11640 11641Any additional arguments on the command line are normally treated as 11642input files to be processed in the order specified. However, an 11643argument that has the form @code{@var{var}=@var{value}}, assigns 11644the value @var{value} to the variable @var{var}---it does not specify a 11645file at all. 11646 11647@vindex ARGIND 11648@vindex ARGV 11649All these arguments are made available to your @code{awk} program in the 11650@code{ARGV} array (@pxref{Built-in Variables}). Command line options 11651and the program text (if present) are omitted from @code{ARGV}. 11652All other arguments, including variable assignments, are 11653included. As each element of @code{ARGV} is processed, @code{gawk} 11654sets the variable @code{ARGIND} to the index in @code{ARGV} of the 11655current element. 11656 11657The distinction between file name arguments and variable-assignment 11658arguments is made when @code{awk} is about to open the next input file. 11659At that point in execution, it checks the ``file name'' to see whether 11660it is really a variable assignment; if so, @code{awk} sets the variable 11661instead of reading a file. 11662 11663Therefore, the variables actually receive the given values after all 11664previously specified files have been read. In particular, the values of 11665variables assigned in this fashion are @emph{not} available inside a 11666@code{BEGIN} rule 11667(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}), 11668since such rules are run before @code{awk} begins scanning the argument list. 11669 11670@cindex dark corner 11671The variable values given on the command line are processed for escape 11672sequences (d.c.) (@pxref{Escape Sequences}). 11673 11674In some earlier implementations of @code{awk}, when a variable assignment 11675occurred before any file names, the assignment would happen @emph{before} 11676the @code{BEGIN} rule was executed. @code{awk}'s behavior was thus 11677inconsistent; some command line assignments were available inside the 11678@code{BEGIN} rule, while others were not. However, 11679some applications came to depend 11680upon this ``feature.'' When @code{awk} was changed to be more consistent, 11681the @samp{-v} option was added to accommodate applications that depended 11682upon the old behavior. 11683 11684The variable assignment feature is most useful for assigning to variables 11685such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and 11686output formats, before scanning the data files. It is also useful for 11687controlling state if multiple passes are needed over a data file. For 11688example: 11689 11690@cindex multiple passes over data 11691@cindex passes, multiple 11692@example 11693awk 'pass == 1 @{ @var{pass 1 stuff} @} 11694 pass == 2 @{ @var{pass 2 stuff} @}' pass=1 mydata pass=2 mydata 11695@end example 11696 11697Given the variable assignment feature, the @samp{-F} option for setting 11698the value of @code{FS} is not 11699strictly necessary. It remains for historical compatibility. 11700 11701@node AWKPATH Variable, Obsolete, Other Arguments, Invoking Gawk 11702@section The @code{AWKPATH} Environment Variable 11703@cindex @code{AWKPATH} environment variable 11704@cindex environment variable, @code{AWKPATH} 11705@cindex search path 11706@cindex directory search 11707@cindex path, search 11708@cindex differences between @code{gawk} and @code{awk} 11709 11710The previous section described how @code{awk} program files can be named 11711on the command line with the @samp{-f} option. In most @code{awk} 11712implementations, you must supply a precise path name for each program 11713file, unless the file is in the current directory. 11714 11715@cindex search path, for source files 11716But in @code{gawk}, if the file name supplied to the @samp{-f} option 11717does not contain a @samp{/}, then @code{gawk} searches a list of 11718directories (called the @dfn{search path}), one by one, looking for a 11719file with the specified name. 11720 11721The search path is a string consisting of directory names 11722separated by colons. @code{gawk} gets its search path from the 11723@code{AWKPATH} environment variable. If that variable does not exist, 11724@code{gawk} uses a default path, which is 11725@samp{.:/usr/local/share/awk}.@footnote{Your version of @code{gawk} 11726may use a different directory; it 11727will depend upon how @code{gawk} was built and installed. The actual 11728directory will be the value of @samp{$(datadir)} generated when 11729@code{gawk} was configured. You probably don't need to worry about this 11730though.} (Programs written for use by 11731system administrators should use an @code{AWKPATH} variable that 11732does not include the current directory, @file{.}.) 11733 11734The search path feature is particularly useful for building up libraries 11735of useful @code{awk} functions. The library files can be placed in a 11736standard directory that is in the default path, and then specified on 11737the command line with a short file name. Otherwise, the full file name 11738would have to be typed for each file. 11739 11740By using both the @samp{--source} and @samp{-f} options, your command line 11741@code{awk} programs can use facilities in @code{awk} library files. 11742@xref{Library Functions, , A Library of @code{awk} Functions}. 11743 11744Path searching is not done if @code{gawk} is in compatibility mode. 11745This is true for both @samp{--traditional} and @samp{--posix}. 11746@xref{Options, ,Command Line Options}. 11747 11748@strong{Note:} if you want files in the current directory to be found, 11749you must include the current directory in the path, either by including 11750@file{.} explicitly in the path, or by writing a null entry in the 11751path. (A null entry is indicated by starting or ending the path with a 11752colon, or by placing two colons next to each other (@samp{::}).) If the 11753current directory is not included in the path, then files cannot be 11754found in the current directory. This path search mechanism is identical 11755to the shell's. 11756@c someday, @cite{The Bourne Again Shell}.... 11757 11758Starting with version 3.0, if @code{AWKPATH} is not defined in the 11759environment, @code{gawk} will place its default search path into 11760@code{ENVIRON["AWKPATH"]}. This makes it easy to determine 11761the actual search path @code{gawk} will use. 11762 11763@node Obsolete, Undocumented, AWKPATH Variable, Invoking Gawk 11764@section Obsolete Options and/or Features 11765 11766@cindex deprecated options 11767@cindex obsolete options 11768@cindex deprecated features 11769@cindex obsolete features 11770This section describes features and/or command line options from 11771previous releases of @code{gawk} that are either not available in the 11772current version, or that are still supported but deprecated (meaning that 11773they will @emph{not} be in the next release). 11774 11775@c update this section for each release! 11776 11777For version @value{VERSION}.@value{PATCHLEVEL} of @code{gawk}, there are no 11778command line options 11779or other deprecated features from the previous version of @code{gawk}. 11780@iftex 11781This section 11782@end iftex 11783@ifinfo 11784This node 11785@end ifinfo 11786is thus essentially a place holder, 11787in case some option becomes obsolete in a future version of @code{gawk}. 11788 11789@ignore 11790@c This is pretty old news... 11791The public-domain version of @code{strftime} that is distributed with 11792@code{gawk} changed for the 2.14 release. The @samp{%V} conversion specifier 11793that used to generate the date in VMS format was changed to @samp{%v}. 11794This is because the POSIX standard for the @code{date} utility now 11795specifies a @samp{%V} conversion specifier. 11796@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for details. 11797@end ignore 11798 11799@node Undocumented, Known Bugs, Obsolete, Invoking Gawk 11800@section Undocumented Options and Features 11801@cindex undocumented features 11802@display 11803@i{Use the Source, Luke!} 11804Obi-Wan 11805@end display 11806@sp 1 11807 11808This section intentionally left blank. 11809 11810@c Read The Source, Luke! 11811 11812@ignore 11813@c If these came out in the Info file or TeX document, then they wouldn't 11814@c be undocumented, would they? 11815 11816@code{gawk} has one undocumented option: 11817 11818@table @code 11819@item -W nostalgia 11820@itemx --nostalgia 11821Print the message @code{"awk: bailing out near line 1"} and dump core. 11822This option was inspired by the common behavior of very early versions of 11823Unix @code{awk}, and by a t--shirt. 11824@end table 11825 11826Early versions of @code{awk} used to not require any separator (either 11827a newline or @samp{;}) between the rules in @code{awk} programs. Thus, 11828it was common to see one-line programs like: 11829 11830@example 11831awk '@{ sum += $1 @} END @{ print sum @}' 11832@end example 11833 11834@code{gawk} actually supports this, but it is purposely undocumented 11835since it is considered bad style. The correct way to write such a program 11836is either 11837 11838@example 11839awk '@{ sum += $1 @} ; END @{ print sum @}' 11840@end example 11841 11842@noindent 11843or 11844 11845@example 11846awk '@{ sum += $1 @} 11847 END @{ print sum @}' data 11848@end example 11849 11850@noindent 11851@xref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a fuller 11852explanation. 11853 11854@end ignore 11855 11856@node Known Bugs, , Undocumented, Invoking Gawk 11857@section Known Bugs in @code{gawk} 11858@cindex bugs, known in @code{gawk} 11859@cindex known bugs 11860 11861@itemize @bullet 11862@item 11863The @samp{-F} option for changing the value of @code{FS} 11864(@pxref{Options, ,Command Line Options}) 11865is not necessary given the command line variable 11866assignment feature; it remains only for backwards compatibility. 11867 11868@item 11869If your system actually has support for @file{/dev/fd} and the 11870associated @file{/dev/stdin}, @file{/dev/stdout}, and 11871@file{/dev/stderr} files, you may get different output from @code{gawk} 11872than you would get on a system without those files. When @code{gawk} 11873interprets these files internally, it synchronizes output to the 11874standard output with output to @file{/dev/stdout}, while on a system 11875with those files, the output is actually to different open files 11876(@pxref{Special Files, ,Special File Names in @code{gawk}}). 11877 11878@item 11879Syntactically invalid single character programs tend to overflow 11880the parse stack, generating a rather unhelpful message. Such programs 11881are surprisingly difficult to diagnose in the completely general case, 11882and the effort to do so really is not worth it. 11883@end itemize 11884 11885@node Library Functions, Sample Programs, Invoking Gawk, Top 11886@chapter A Library of @code{awk} Functions 11887 11888@c 2e: USE TEXINFO-2 FUNCTION DEFINITION STUFF!!!!!!!!!!!!! 11889This chapter presents a library of useful @code{awk} functions. The 11890sample programs presented later 11891(@pxref{Sample Programs, ,Practical @code{awk} Programs}) 11892use these functions. 11893The functions are presented here in a progression from simple to complex. 11894 11895@ref{Extract Program, ,Extracting Programs from Texinfo Source Files}, 11896presents a program that you can use to extract the source code for 11897these example library functions and programs from the Texinfo source 11898for this @value{DOCUMENT}. 11899(This has already been done as part of the @code{gawk} distribution.) 11900 11901If you have written one or more useful, general purpose @code{awk} functions, 11902and would like to contribute them for a subsequent edition of this @value{DOCUMENT}, 11903please contact the author. @xref{Bugs, ,Reporting Problems and Bugs}, 11904for information on doing this. Don't just send code, as you will be 11905required to either place your code in the public domain, 11906publish it under the GPL (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}), 11907or assign the copyright in it to the Free Software Foundation. 11908 11909@menu 11910* Portability Notes:: What to do if you don't have @code{gawk}. 11911* Nextfile Function:: Two implementations of a @code{nextfile} 11912 function. 11913* Assert Function:: A function for assertions in @code{awk} 11914 programs. 11915* Round Function:: A function for rounding if @code{sprintf} does 11916 not do it correctly. 11917* Ordinal Functions:: Functions for using characters as numbers and 11918 vice versa. 11919* Join Function:: A function to join an array into a string. 11920* Mktime Function:: A function to turn a date into a timestamp. 11921* Gettimeofday Function:: A function to get formatted times. 11922* Filetrans Function:: A function for handling data file transitions. 11923* Getopt Function:: A function for processing command line 11924 arguments. 11925* Passwd Functions:: Functions for getting user information. 11926* Group Functions:: Functions for getting group information. 11927* Library Names:: How to best name private global variables in 11928 library functions. 11929@end menu 11930 11931@node Portability Notes, Nextfile Function, Library Functions, Library Functions 11932@section Simulating @code{gawk}-specific Features 11933@cindex portability issues 11934 11935The programs in this chapter and in 11936@ref{Sample Programs, ,Practical @code{awk} Programs}, 11937freely use features that are specific to @code{gawk}. 11938This section briefly discusses how you can rewrite these programs for 11939different implementations of @code{awk}. 11940 11941Diagnostic error messages are sent to @file{/dev/stderr}. 11942Use @samp{| "cat 1>&2"} instead of @samp{> "/dev/stderr"}, if your system 11943does not have a @file{/dev/stderr}, or if you cannot use @code{gawk}. 11944 11945A number of programs use @code{nextfile} 11946(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}), 11947to skip any remaining input in the input file. 11948@ref{Nextfile Function, ,Implementing @code{nextfile} as a Function}, 11949shows you how to write a function that will do the same thing. 11950 11951Finally, some of the programs choose to ignore upper-case and lower-case 11952distinctions in their input. They do this by assigning one to @code{IGNORECASE}. 11953You can achieve the same effect by adding the following rule to the 11954beginning of the program: 11955 11956@example 11957# ignore case 11958@{ $0 = tolower($0) @} 11959@end example 11960 11961@noindent 11962Also, verify that all regexp and string constants used in 11963comparisons only use lower-case letters. 11964 11965@node Nextfile Function, Assert Function, Portability Notes, Library Functions 11966@section Implementing @code{nextfile} as a Function 11967 11968@cindex skipping input files 11969@cindex input files, skipping 11970The @code{nextfile} statement presented in 11971@ref{Nextfile Statement, ,The @code{nextfile} Statement}, 11972is a @code{gawk}-specific extension. It is not available in other 11973implementations of @code{awk}. This section shows two versions of a 11974@code{nextfile} function that you can use to simulate @code{gawk}'s 11975@code{nextfile} statement if you cannot use @code{gawk}. 11976 11977Here is a first attempt at writing a @code{nextfile} function. 11978 11979@example 11980@group 11981# nextfile --- skip remaining records in current file 11982 11983# this should be read in before the "main" awk program 11984 11985function nextfile() @{ _abandon_ = FILENAME; next @} 11986 11987_abandon_ == FILENAME @{ next @} 11988@end group 11989@end example 11990 11991This file should be included before the main program, because it supplies 11992a rule that must be executed first. This rule compares the current data 11993file's name (which is always in the @code{FILENAME} variable) to a private 11994variable named @code{_abandon_}. If the file name matches, then the action 11995part of the rule executes a @code{next} statement, to go on to the next 11996record. (The use of @samp{_} in the variable name is a convention. 11997It is discussed more fully in 11998@ref{Library Names, , Naming Library Function Global Variables}.) 11999 12000The use of the @code{next} statement effectively creates a loop that reads 12001all the records from the current data file. 12002Eventually, the end of the file is reached, and 12003a new data file is opened, changing the value of @code{FILENAME}. 12004Once this happens, the comparison of @code{_abandon_} to @code{FILENAME} 12005fails, and execution continues with the first rule of the ``real'' program. 12006 12007The @code{nextfile} function itself simply sets the value of @code{_abandon_} 12008and then executes a @code{next} statement to start the loop 12009going.@footnote{Some implementations of @code{awk} do not allow you to 12010execute @code{next} from within a function body. Some other work-around 12011will be necessary if you use such a version.} 12012@c mawk is what we're talking about. 12013 12014This initial version has a subtle problem. What happens if the same data 12015file is listed @emph{twice} on the command line, one right after the other, 12016or even with just a variable assignment between the two occurrences of 12017the file name? 12018 12019@c @findex nextfile 12020@c do it this way, since all the indices are merged 12021@cindex @code{nextfile} function 12022In such a case, 12023this code will skip right through the file, a second time, even though 12024it should stop when it gets to the end of the first occurrence. 12025Here is a second version of @code{nextfile} that remedies this problem. 12026 12027@example 12028@c file eg/lib/nextfile.awk 12029# nextfile --- skip remaining records in current file 12030# correctly handle successive occurrences of the same file 12031# Arnold Robbins, arnold@@gnu.org, Public Domain 12032# May, 1993 12033 12034# this should be read in before the "main" awk program 12035 12036function nextfile() @{ _abandon_ = FILENAME; next @} 12037 12038@group 12039_abandon_ == FILENAME @{ 12040 if (FNR == 1) 12041 _abandon_ = "" 12042 else 12043 next 12044@} 12045@end group 12046@c endfile 12047@end example 12048 12049The @code{nextfile} function has not changed. It sets @code{_abandon_} 12050equal to the current file name and then executes a @code{next} satement. 12051The @code{next} statement reads the next record and increments @code{FNR}, 12052so @code{FNR} is guaranteed to have a value of at least two. 12053However, if @code{nextfile} is called for the last record in the file, 12054then @code{awk} will close the current data file and move on to the next 12055one. Upon doing so, @code{FILENAME} will be set to the name of the new file, 12056and @code{FNR} will be reset to one. If this next file is the same as 12057the previous one, @code{_abandon_} will still be equal to @code{FILENAME}. 12058However, @code{FNR} will be equal to one, telling us that this is a new 12059occurrence of the file, and not the one we were reading when the 12060@code{nextfile} function was executed. In that case, @code{_abandon_} 12061is reset to the empty string, so that further executions of this rule 12062will fail (until the next time that @code{nextfile} is called). 12063 12064If @code{FNR} is not one, then we are still in the original data file, 12065and the program executes a @code{next} statement to skip through it. 12066 12067An important question to ask at this point is: ``Given that the 12068functionality of @code{nextfile} can be provided with a library file, 12069why is it built into @code{gawk}?'' This is an important question. Adding 12070features for little reason leads to larger, slower programs that are 12071harder to maintain. 12072 12073The answer is that building @code{nextfile} into @code{gawk} provides 12074significant gains in efficiency. If the @code{nextfile} function is executed 12075at the beginning of a large data file, @code{awk} still has to scan the entire 12076file, splitting it up into records, just to skip over it. The built-in 12077@code{nextfile} can simply close the file immediately and proceed to the 12078next one, saving a lot of time. This is particularly important in 12079@code{awk}, since @code{awk} programs are generally I/O bound (i.e.@: 12080they spend most of their time doing input and output, instead of performing 12081computations). 12082 12083@node Assert Function, Round Function, Nextfile Function, Library Functions 12084@section Assertions 12085 12086@cindex assertions 12087@cindex @code{assert}, C version 12088When writing large programs, it is often useful to be able to know 12089that a condition or set of conditions is true. Before proceeding with a 12090particular computation, you make a statement about what you believe to be 12091the case. Such a statement is known as an 12092``assertion.'' The C language provides an @code{<assert.h>} header file 12093and corresponding @code{assert} macro that the programmer can use to make 12094assertions. If an assertion fails, the @code{assert} macro arranges to 12095print a diagnostic message describing the condition that should have 12096been true but was not, and then it kills the program. In C, using 12097@code{assert} looks this: 12098 12099@c NEEDED 12100@page 12101@example 12102#include <assert.h> 12103 12104int myfunc(int a, double b) 12105@{ 12106 assert(a <= 5 && b >= 17); 12107 @dots{} 12108@} 12109@end example 12110 12111If the assertion failed, the program would print a message similar to 12112this: 12113 12114@example 12115prog.c:5: assertion failed: a <= 5 && b >= 17 12116@end example 12117 12118@findex assert 12119The ANSI C language makes it possible to turn the condition into a string for use 12120in printing the diagnostic message. This is not possible in @code{awk}, so 12121this @code{assert} function also requires a string version of the condition 12122that is being tested. 12123 12124@example 12125@c @group 12126@c file eg/lib/assert.awk 12127# assert --- assert that a condition is true. Otherwise exit. 12128# Arnold Robbins, arnold@@gnu.org, Public Domain 12129# May, 1993 12130 12131function assert(condition, string) 12132@{ 12133 if (! condition) @{ 12134 printf("%s:%d: assertion failed: %s\n", 12135 FILENAME, FNR, string) > "/dev/stderr" 12136 _assert_exit = 1 12137 exit 1 12138 @} 12139@} 12140 12141END @{ 12142 if (_assert_exit) 12143 exit 1 12144@} 12145@c endfile 12146@c @end group 12147@end example 12148 12149The @code{assert} function tests the @code{condition} parameter. If it 12150is false, it prints a message to standard error, using the @code{string} 12151parameter to describe the failed condition. It then sets the variable 12152@code{_assert_exit} to one, and executes the @code{exit} statement. 12153The @code{exit} statement jumps to the @code{END} rule. If the @code{END} 12154rules finds @code{_assert_exit} to be true, then it exits immediately. 12155 12156The purpose of the @code{END} rule with its test is to 12157keep any other @code{END} rules from running. When an assertion fails, the 12158program should exit immediately. 12159If no assertions fail, then @code{_assert_exit} will still be 12160false when the @code{END} rule is run normally, and the rest of the 12161program's @code{END} rules will execute. 12162For all of this to work correctly, @file{assert.awk} must be the 12163first source file read by @code{awk}. 12164 12165@c NEEDED 12166@page 12167You would use this function in your programs this way: 12168 12169@example 12170function myfunc(a, b) 12171@{ 12172 assert(a <= 5 && b >= 17, "a <= 5 && b >= 17") 12173 @dots{} 12174@} 12175@end example 12176 12177@noindent 12178If the assertion failed, you would see a message like this: 12179 12180@example 12181mydata:1357: assertion failed: a <= 5 && b >= 17 12182@end example 12183 12184There is a problem with this version of @code{assert}, that it may not 12185be possible to work around with standard @code{awk}. 12186An @code{END} rule is automatically added 12187to the program calling @code{assert}. Normally, if a program consists 12188of just a @code{BEGIN} rule, the input files and/or standard input are 12189not read. However, now that the program has an @code{END} rule, @code{awk} 12190will attempt to read the input data files, or standard input 12191(@pxref{Using BEGIN/END, , Startup and Cleanup Actions}), 12192most likely causing the program to hang, waiting for input. 12193 12194@node Round Function, Ordinal Functions, Assert Function, Library Functions 12195@section Rounding Numbers 12196 12197@cindex rounding 12198The way @code{printf} and @code{sprintf} 12199(@pxref{Printf, , Using @code{printf} Statements for Fancier Printing}) 12200do rounding will often depend 12201upon the system's C @code{sprintf} subroutine. 12202On many machines, 12203@code{sprintf} rounding is ``unbiased,'' which means it doesn't always 12204round a trailing @samp{.5} up, contrary to naive expectations. In unbiased 12205rounding, @samp{.5} rounds to even, rather than always up, so 1.5 rounds to 122062 but 4.5 rounds to 4. 12207The result is that if you are using a format that does 12208rounding (e.g., @code{"%.0f"}) you should check what your system does. 12209The following function does traditional rounding; 12210it might be useful if your awk's @code{printf} does unbiased rounding. 12211 12212@findex round 12213@example 12214@c file eg/lib/round.awk 12215# round --- do normal rounding 12216# 12217# Arnold Robbins, arnold@@gnu.org, August, 1996 12218# Public Domain 12219 12220function round(x, ival, aval, fraction) 12221@{ 12222 ival = int(x) # integer part, int() truncates 12223 12224 # see if fractional part 12225 if (ival == x) # no fraction 12226 return x 12227 12228 if (x < 0) @{ 12229 aval = -x # absolute value 12230 ival = int(aval) 12231 fraction = aval - ival 12232@group 12233 if (fraction >= .5) 12234 return int(x) - 1 # -2.5 --> -3 12235 else 12236 return int(x) # -2.3 --> -2 12237@end group 12238 @} else @{ 12239 fraction = x - ival 12240 if (fraction >= .5) 12241 return ival + 1 12242 else 12243 return ival 12244 @} 12245@} 12246 12247# test harness 12248@{ print $0, round($0) @} 12249@c endfile 12250@end example 12251 12252@node Ordinal Functions, Join Function, Round Function, Library Functions 12253@section Translating Between Characters and Numbers 12254 12255@cindex numeric character values 12256@cindex values of characters as numbers 12257One commercial implementation of @code{awk} supplies a built-in function, 12258@code{ord}, which takes a character and returns the numeric value for that 12259character in the machine's character set. If the string passed to 12260@code{ord} has more than one character, only the first one is used. 12261 12262The inverse of this function is @code{chr} (from the function of the same 12263name in Pascal), which takes a number and returns the corresponding character. 12264 12265Both functions can be written very nicely in @code{awk}; there is no real 12266reason to build them into the @code{awk} interpreter. 12267 12268@findex ord 12269@findex chr 12270@example 12271@group 12272@c file eg/lib/ord.awk 12273# ord.awk --- do ord and chr 12274# 12275# Global identifiers: 12276# _ord_: numerical values indexed by characters 12277# _ord_init: function to initialize _ord_ 12278# 12279# Arnold Robbins 12280# arnold@@gnu.org 12281# Public Domain 12282# 16 January, 1992 12283# 20 July, 1992, revised 12284 12285BEGIN @{ _ord_init() @} 12286@c endfile 12287@end group 12288 12289@c @group 12290@c file eg/lib/ord.awk 12291function _ord_init( low, high, i, t) 12292@{ 12293 low = sprintf("%c", 7) # BEL is ascii 7 12294 if (low == "\a") @{ # regular ascii 12295 low = 0 12296 high = 127 12297 @} else if (sprintf("%c", 128 + 7) == "\a") @{ 12298 # ascii, mark parity 12299 low = 128 12300 high = 255 12301 @} else @{ # ebcdic(!) 12302 low = 0 12303 high = 255 12304 @} 12305 12306 for (i = low; i <= high; i++) @{ 12307 t = sprintf("%c", i) 12308 _ord_[t] = i 12309 @} 12310@} 12311@c endfile 12312@c @end group 12313@end example 12314 12315@cindex character sets 12316@cindex character encodings 12317@cindex ASCII 12318@cindex EBCDIC 12319@cindex mark parity 12320Some explanation of the numbers used by @code{chr} is worthwhile. 12321The most prominent character set in use today is ASCII. Although an 12322eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only 12323defines characters that use the values from zero to 127.@footnote{ASCII 12324has been extended in many countries to use the values from 128 to 255 12325for country-specific characters. If your system uses these extensions, 12326you can simplify @code{_ord_init} to simply loop from zero to 255.} 12327At least one computer manufacturer that we know of 12328@c Pr1me, blech 12329uses ASCII, but with mark parity, meaning that the leftmost bit in the byte 12330is always one. What this means is that on those systems, characters 12331have numeric values from 128 to 255. 12332Finally, large mainframe systems use the EBCDIC character set, which 12333uses all 256 values. 12334While there are other character sets in use on some older systems, 12335they are not really worth worrying about. 12336 12337@example 12338@group 12339@c file eg/lib/ord.awk 12340function ord(str, c) 12341@{ 12342 # only first character is of interest 12343 c = substr(str, 1, 1) 12344 return _ord_[c] 12345@} 12346@c endfile 12347@end group 12348 12349@group 12350@c file eg/lib/ord.awk 12351function chr(c) 12352@{ 12353 # force c to be numeric by adding 0 12354 return sprintf("%c", c + 0) 12355@} 12356@c endfile 12357@end group 12358 12359@group 12360@c file eg/lib/ord.awk 12361#### test code #### 12362# BEGIN \ 12363# @{ 12364# for (;;) @{ 12365# printf("enter a character: ") 12366# if (getline var <= 0) 12367# break 12368# printf("ord(%s) = %d\n", var, ord(var)) 12369# @} 12370# @} 12371@c endfile 12372@end group 12373@end example 12374 12375An obvious improvement to these functions would be to move the code for the 12376@code{@w{_ord_init}} function into the body of the @code{BEGIN} rule. It was 12377written this way initially for ease of development. 12378 12379There is a ``test program'' in a @code{BEGIN} rule, for testing the 12380function. It is commented out for production use. 12381 12382@node Join Function, Mktime Function, Ordinal Functions, Library Functions 12383@section Merging an Array Into a String 12384 12385@cindex merging strings 12386When doing string processing, it is often useful to be able to join 12387all the strings in an array into one long string. The following function, 12388@code{join}, accomplishes this task. It is used later in several of 12389the application programs 12390(@pxref{Sample Programs, ,Practical @code{awk} Programs}). 12391 12392Good function design is important; this function needs to be general, but it 12393should also have a reasonable default behavior. It is called with an array 12394and the beginning and ending indices of the elements in the array to be 12395merged. This assumes that the array indices are numeric---a reasonable 12396assumption since the array was likely created with @code{split} 12397(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 12398 12399@findex join 12400@example 12401@group 12402@c file eg/lib/join.awk 12403# join.awk --- join an array into a string 12404# Arnold Robbins, arnold@@gnu.org, Public Domain 12405# May 1993 12406 12407function join(array, start, end, sep, result, i) 12408@{ 12409 if (sep == "") 12410 sep = " " 12411 else if (sep == SUBSEP) # magic value 12412 sep = "" 12413 result = array[start] 12414 for (i = start + 1; i <= end; i++) 12415 result = result sep array[i] 12416 return result 12417@} 12418@c endfile 12419@end group 12420@end example 12421 12422An optional additional argument is the separator to use when joining the 12423strings back together. If the caller supplies a non-empty value, 12424@code{join} uses it. If it is not supplied, it will have a null 12425value. In this case, @code{join} uses a single blank as a default 12426separator for the strings. If the value is equal to @code{SUBSEP}, 12427then @code{join} joins the strings with no separator between them. 12428@code{SUBSEP} serves as a ``magic'' value to indicate that there should 12429be no separation between the component strings. 12430 12431It would be nice if @code{awk} had an assignment operator for concatenation. 12432The lack of an explicit operator for concatenation makes string operations 12433more difficult than they really need to be. 12434 12435@node Mktime Function, Gettimeofday Function, Join Function, Library Functions 12436@section Turning Dates Into Timestamps 12437 12438The @code{systime} function built in to @code{gawk} 12439returns the current time of day as 12440a timestamp in ``seconds since the Epoch.'' This timestamp 12441can be converted into a printable date of almost infinitely variable 12442format using the built-in @code{strftime} function. 12443(For more information on @code{systime} and @code{strftime}, 12444@pxref{Time Functions, ,Functions for Dealing with Time Stamps}.) 12445 12446@cindex converting dates to timestamps 12447@cindex dates, converting to timestamps 12448@cindex timestamps, converting from dates 12449An interesting but difficult problem is to convert a readable representation 12450of a date back into a timestamp. The ANSI C library provides a @code{mktime} 12451function that does the basic job, converting a canonical representation of a 12452date into a timestamp. 12453 12454It would appear at first glance that @code{gawk} would have to supply a 12455@code{mktime} built-in function that was simply a ``hook'' to the C language 12456version. In fact though, @code{mktime} can be implemented entirely in 12457@code{awk}.@footnote{@value{UPDATE-MONTH}: Actually, I was mistaken when 12458I wrote this. The version presented here doesn't always work correctly, 12459and the next major version of @code{gawk} will provide @code{mktime} 12460as a built-in function.} 12461@c sigh. 12462 12463Here is a version of @code{mktime} for @code{awk}. It takes a simple 12464representation of the date and time, and converts it into a timestamp. 12465 12466The code is presented here intermixed with explanatory prose. In 12467@ref{Extract Program, ,Extracting Programs from Texinfo Source Files}, 12468you will see how the Texinfo source file for this @value{DOCUMENT} 12469can be processed to extract the code into a single source file. 12470 12471The program begins with a descriptive comment and a @code{BEGIN} rule 12472that initializes a table @code{_tm_months}. This table is a two-dimensional 12473array that has the lengths of the months. The first index is zero for 12474regular years, and one for leap years. The values are the same for all the 12475months in both kinds of years, except for February; thus the use of multiple 12476assignment. 12477 12478@example 12479@c @group 12480@c file eg/lib/mktime.awk 12481# mktime.awk --- convert a canonical date representation 12482# into a timestamp 12483# Arnold Robbins, arnold@@gnu.org, Public Domain 12484# May 1993 12485 12486BEGIN \ 12487@{ 12488 # Initialize table of month lengths 12489 _tm_months[0,1] = _tm_months[1,1] = 31 12490 _tm_months[0,2] = 28; _tm_months[1,2] = 29 12491 _tm_months[0,3] = _tm_months[1,3] = 31 12492 _tm_months[0,4] = _tm_months[1,4] = 30 12493 _tm_months[0,5] = _tm_months[1,5] = 31 12494 _tm_months[0,6] = _tm_months[1,6] = 30 12495 _tm_months[0,7] = _tm_months[1,7] = 31 12496 _tm_months[0,8] = _tm_months[1,8] = 31 12497 _tm_months[0,9] = _tm_months[1,9] = 30 12498 _tm_months[0,10] = _tm_months[1,10] = 31 12499 _tm_months[0,11] = _tm_months[1,11] = 30 12500 _tm_months[0,12] = _tm_months[1,12] = 31 12501@} 12502@c endfile 12503@c @end group 12504@end example 12505 12506The benefit of merging multiple @code{BEGIN} rules 12507(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}) 12508is particularly clear when writing library files. Functions in library 12509files can cleanly initialize their own private data and also provide clean-up 12510actions in private @code{END} rules. 12511 12512The next function is a simple one that computes whether a given year is or 12513is not a leap year. If a year is evenly divisible by four, but not evenly 12514divisible by 100, or if it is evenly divisible by 400, then it is a leap 12515year. Thus, 1904 was a leap year, 1900 was not, but 2000 will be. 12516@c Change this after the year 2000 to ``2000 was'' (:-) 12517 12518@findex _tm_isleap 12519@example 12520@group 12521@c file eg/lib/mktime.awk 12522# decide if a year is a leap year 12523function _tm_isleap(year, ret) 12524@{ 12525 ret = (year % 4 == 0 && year % 100 != 0) || 12526 (year % 400 == 0) 12527 12528 return ret 12529@} 12530@c endfile 12531@end group 12532@end example 12533 12534This function is only used a few times in this file, and its computation 12535could have been written @dfn{in-line} (at the point where it's used). 12536Making it a separate function made the original development easier, and also 12537avoids the possibility of typing errors when duplicating the code in 12538multiple places. 12539 12540The next function is more interesting. It does most of the work of 12541generating a timestamp, which is converting a date and time into some number 12542of seconds since the Epoch. The caller passes an array (rather 12543imaginatively named @code{a}) containing six 12544values: the year including century, the month as a number between one and 12, 12545the day of the month, the hour as a number between zero and 23, the minute in 12546the hour, and the seconds within the minute. 12547 12548The function uses several local variables to precompute the number of 12549seconds in an hour, seconds in a day, and seconds in a year. Often, 12550similar C code simply writes out the expression in-line, expecting the 12551compiler to do @dfn{constant folding}. E.g., most C compilers would 12552turn @samp{60 * 60} into @samp{3600} at compile time, instead of recomputing 12553it every time at run time. Precomputing these values makes the 12554function more efficient. 12555 12556@findex _tm_addup 12557@example 12558@c @group 12559@c file eg/lib/mktime.awk 12560# convert a date into seconds 12561function _tm_addup(a, total, yearsecs, daysecs, 12562 hoursecs, i, j) 12563@{ 12564 hoursecs = 60 * 60 12565 daysecs = 24 * hoursecs 12566 yearsecs = 365 * daysecs 12567 12568 total = (a[1] - 1970) * yearsecs 12569 12570@group 12571 # extra day for leap years 12572 for (i = 1970; i < a[1]; i++) 12573 if (_tm_isleap(i)) 12574 total += daysecs 12575@end group 12576 12577@group 12578 j = _tm_isleap(a[1]) 12579 for (i = 1; i < a[2]; i++) 12580 total += _tm_months[j, i] * daysecs 12581@end group 12582 12583 total += (a[3] - 1) * daysecs 12584 total += a[4] * hoursecs 12585 total += a[5] * 60 12586 total += a[6] 12587 12588 return total 12589@} 12590@c endfile 12591@c @end group 12592@end example 12593 12594The function starts with a first approximation of all the seconds between 12595Midnight, January 1, 1970,@footnote{This is the Epoch on POSIX systems. 12596It may be different on other systems.} and the beginning of the current 12597year. It then goes through all those years, and for every leap year, 12598adds an additional day's worth of seconds. 12599 12600The variable @code{j} holds either one or zero, if the current year is or is not 12601a leap year. 12602For every month in the current year prior to the current month, it adds 12603the number of seconds in the month, using the appropriate entry in the 12604@code{_tm_months} array. 12605 12606Finally, it adds in the seconds for the number of days prior to the current 12607day, and the number of hours, minutes, and seconds in the current day. 12608 12609The result is a count of seconds since January 1, 1970. This value is not 12610yet what is needed though. The reason why is described shortly. 12611 12612The main @code{mktime} function takes a single character string argument. 12613This string is a representation of a date and time in a ``canonical'' 12614(fixed) form. This string should be 12615@code{"@var{year} @var{month} @var{day} @var{hour} @var{minute} @var{second}"}. 12616 12617@findex mktime 12618@example 12619@c @group 12620@c file eg/lib/mktime.awk 12621# mktime --- convert a date into seconds, 12622# compensate for time zone 12623 12624function mktime(str, res1, res2, a, b, i, j, t, diff) 12625@{ 12626 i = split(str, a, " ") # don't rely on FS 12627 12628 if (i != 6) 12629 return -1 12630 12631 # force numeric 12632 for (j in a) 12633 a[j] += 0 12634 12635@group 12636 # validate 12637 if (a[1] < 1970 || 12638 a[2] < 1 || a[2] > 12 || 12639 a[3] < 1 || a[3] > 31 || 12640 a[4] < 0 || a[4] > 23 || 12641 a[5] < 0 || a[5] > 59 || 12642 a[6] < 0 || a[6] > 60 ) 12643 return -1 12644@end group 12645 12646 res1 = _tm_addup(a) 12647 t = strftime("%Y %m %d %H %M %S", res1) 12648 12649 if (_tm_debug) 12650 printf("(%s) -> (%s)\n", str, t) > "/dev/stderr" 12651 12652 split(t, b, " ") 12653 res2 = _tm_addup(b) 12654 12655 diff = res1 - res2 12656 12657 if (_tm_debug) 12658 printf("diff = %d seconds\n", diff) > "/dev/stderr" 12659 12660 res1 += diff 12661 12662 return res1 12663@} 12664@c endfile 12665@c @end group 12666@end example 12667 12668The function first splits the string into an array, using spaces and tabs as 12669separators. If there are not six elements in the array, it returns an 12670error, signaled as the value @minus{}1. 12671Next, it forces each element of the array to be numeric, by adding zero to it. 12672The following @samp{if} statement then makes sure that each element is 12673within an allowable range. (This checking could be extended further, e.g., 12674to make sure that the day of the month is within the correct range for the 12675particular month supplied.) All of this is essentially preliminary set-up 12676and error checking. 12677 12678Recall that @code{_tm_addup} generated a value in seconds since Midnight, 12679January 1, 1970. This value is not directly usable as the result we want, 12680@emph{since the calculation does not account for the local timezone}. In other 12681words, the value represents the count in seconds since the Epoch, but only 12682for UTC (Universal Coordinated Time). If the local timezone is east or west 12683of UTC, then some number of hours should be either added to, or subtracted from 12684the resulting timestamp. 12685 12686For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west 12687of (behind) UTC. It is only four hours behind UTC if daylight savings 12688time is in effect. 12689If you are calling @code{mktime} in Atlanta, with the argument 12690@code{@w{"1993 5 23 18 23 12"}}, the result from @code{_tm_addup} will be 12691for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta. It is necessary to 12692add another four hours worth of seconds to the result. 12693 12694How can @code{mktime} determine how far away it is from UTC? This is 12695surprisingly easy. The returned timestamp represents the time passed to 12696@code{mktime} @emph{as UTC}. This timestamp can be fed back to 12697@code{strftime}, which will format it as a @emph{local} time; i.e.@: as 12698if it already had the UTC difference added in to it. This is done by 12699giving @code{@w{"%Y %m %d %H %M %S"}} to @code{strftime} as the format 12700argument. It returns the computed timestamp in the original string 12701format. The result represents a time that accounts for the UTC 12702difference. When the new time is converted back to a timestamp, the 12703difference between the two timestamps is the difference (in seconds) 12704between the local timezone and UTC. This difference is then added back 12705to the original result. An example demonstrating this is presented below. 12706 12707Finally, there is a ``main'' program for testing the function. 12708 12709@example 12710@c there used to be a blank line after the getline, 12711@c squished out for page formatting reasons 12712@c @group 12713@c file eg/lib/mktime.awk 12714BEGIN @{ 12715 if (_tm_test) @{ 12716 printf "Enter date as yyyy mm dd hh mm ss: " 12717 getline _tm_test_date 12718 t = mktime(_tm_test_date) 12719 r = strftime("%Y %m %d %H %M %S", t) 12720 printf "Got back (%s)\n", r 12721 @} 12722@} 12723@c endfile 12724@c @end group 12725@end example 12726 12727The entire program uses two variables that can be set on the command 12728line to control debugging output and to enable the test in the final 12729@code{BEGIN} rule. Here is the result of a test run. (Note that debugging 12730output is to standard error, and test output is to standard output.) 12731 12732@example 12733@c @group 12734$ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1 12735@print{} Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10 12736@error{} (1993 5 23 15 35 10) -> (1993 05 23 11 35 10) 12737@error{} diff = 14400 seconds 12738@print{} Got back (1993 05 23 15 35 10) 12739@c @end group 12740@end example 12741 12742The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993. 12743The first line 12744of debugging output shows the resulting time as UTC---four hours ahead of 12745the local time zone. The second line shows that the difference is 14400 12746seconds, which is four hours. (The difference is only four hours, since 12747daylight savings time is in effect during May.) 12748The final line of test output shows that the timezone compensation 12749algorithm works; the returned time is the same as the entered time. 12750 12751This program does not solve the general problem of turning an arbitrary date 12752representation into a timestamp. That problem is very involved. However, 12753the @code{mktime} function provides a foundation upon which to build. Other 12754software can convert month names into numeric months, and AM/PM times into 1275524-hour clocks, to generate the ``canonical'' format that @code{mktime} 12756requires. 12757 12758@node Gettimeofday Function, Filetrans Function, Mktime Function, Library Functions 12759@section Managing the Time of Day 12760 12761@cindex formatted timestamps 12762@cindex timestamps, formatted 12763The @code{systime} and @code{strftime} functions described in 12764@ref{Time Functions, ,Functions for Dealing with Time Stamps}, 12765provide the minimum functionality necessary for dealing with the time of day 12766in human readable form. While @code{strftime} is extensive, the control 12767formats are not necessarily easy to remember or intuitively obvious when 12768reading a program. 12769 12770The following function, @code{gettimeofday}, populates a user-supplied array 12771with pre-formatted time information. It returns a string with the current 12772time formatted in the same way as the @code{date} utility. 12773 12774@findex gettimeofday 12775@example 12776@c @group 12777@c file eg/lib/gettime.awk 12778# gettimeofday --- get the time of day in a usable format 12779# Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993 12780# 12781# Returns a string in the format of output of date(1) 12782# Populates the array argument time with individual values: 12783# time["second"] -- seconds (0 - 59) 12784# time["minute"] -- minutes (0 - 59) 12785# time["hour"] -- hours (0 - 23) 12786# time["althour"] -- hours (0 - 12) 12787# time["monthday"] -- day of month (1 - 31) 12788# time["month"] -- month of year (1 - 12) 12789# time["monthname"] -- name of the month 12790# time["shortmonth"] -- short name of the month 12791# time["year"] -- year within century (0 - 99) 12792# time["fullyear"] -- year with century (19xx or 20xx) 12793# time["weekday"] -- day of week (Sunday = 0) 12794# time["altweekday"] -- day of week (Monday = 0) 12795# time["weeknum"] -- week number, Sunday first day 12796# time["altweeknum"] -- week number, Monday first day 12797# time["dayname"] -- name of weekday 12798# time["shortdayname"] -- short name of weekday 12799# time["yearday"] -- day of year (0 - 365) 12800# time["timezone"] -- abbreviation of timezone name 12801# time["ampm"] -- AM or PM designation 12802 12803function gettimeofday(time, ret, now, i) 12804@{ 12805 # get time once, avoids unnecessary system calls 12806 now = systime() 12807 12808 # return date(1)-style output 12809 ret = strftime("%a %b %d %H:%M:%S %Z %Y", now) 12810 12811 # clear out target array 12812 for (i in time) 12813 delete time[i] 12814 12815 # fill in values, force numeric values to be 12816 # numeric by adding 0 12817 time["second"] = strftime("%S", now) + 0 12818 time["minute"] = strftime("%M", now) + 0 12819 time["hour"] = strftime("%H", now) + 0 12820 time["althour"] = strftime("%I", now) + 0 12821 time["monthday"] = strftime("%d", now) + 0 12822 time["month"] = strftime("%m", now) + 0 12823 time["monthname"] = strftime("%B", now) 12824 time["shortmonth"] = strftime("%b", now) 12825 time["year"] = strftime("%y", now) + 0 12826 time["fullyear"] = strftime("%Y", now) + 0 12827 time["weekday"] = strftime("%w", now) + 0 12828 time["altweekday"] = strftime("%u", now) + 0 12829 time["dayname"] = strftime("%A", now) 12830 time["shortdayname"] = strftime("%a", now) 12831 time["yearday"] = strftime("%j", now) + 0 12832 time["timezone"] = strftime("%Z", now) 12833 time["ampm"] = strftime("%p", now) 12834 time["weeknum"] = strftime("%U", now) + 0 12835 time["altweeknum"] = strftime("%W", now) + 0 12836 12837 return ret 12838@} 12839@c endfile 12840@end example 12841 12842The string indices are easier to use and read than the various formats 12843required by @code{strftime}. The @code{alarm} program presented in 12844@ref{Alarm Program, ,An Alarm Clock Program}, 12845uses this function. 12846 12847@c exercise!!! 12848The @code{gettimeofday} function is presented above as it was written. A 12849more general design for this function would have allowed the user to supply 12850an optional timestamp value that would have been used instead of the current 12851time. 12852 12853@node Filetrans Function, Getopt Function, Gettimeofday Function, Library Functions 12854@section Noting Data File Boundaries 12855 12856@cindex per file initialization and clean-up 12857The @code{BEGIN} and @code{END} rules are each executed exactly once, at 12858the beginning and end respectively of your @code{awk} program 12859(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). 12860We (the @code{gawk} authors) once had a user who mistakenly thought that the 12861@code{BEGIN} rule was executed at the beginning of each data file and the 12862@code{END} rule was executed at the end of each data file. When informed 12863that this was not the case, the user requested that we add new special 12864patterns to @code{gawk}, named @code{BEGIN_FILE} and @code{END_FILE}, that 12865would have the desired behavior. He even supplied us the code to do so. 12866 12867However, after a little thought, I came up with the following library program. 12868It arranges to call two user-supplied functions, @code{beginfile} and 12869@code{endfile}, at the beginning and end of each data file. 12870Besides solving the problem in only nine(!) lines of code, it does so 12871@emph{portably}; this will work with any implementation of @code{awk}. 12872 12873@example 12874@c @group 12875# transfile.awk 12876# 12877# Give the user a hook for filename transitions 12878# 12879# The user must supply functions beginfile() and endfile() 12880# that each take the name of the file being started or 12881# finished, respectively. 12882# 12883# Arnold Robbins, arnold@@gnu.org, January 1992 12884# Public Domain 12885 12886FILENAME != _oldfilename \ 12887@{ 12888 if (_oldfilename != "") 12889 endfile(_oldfilename) 12890 _oldfilename = FILENAME 12891 beginfile(FILENAME) 12892@} 12893 12894END @{ endfile(FILENAME) @} 12895@c @end group 12896@end example 12897 12898This file must be loaded before the user's ``main'' program, so that the 12899rule it supplies will be executed first. 12900 12901This rule relies on @code{awk}'s @code{FILENAME} variable that 12902automatically changes for each new data file. The current file name is 12903saved in a private variable, @code{_oldfilename}. If @code{FILENAME} does 12904not equal @code{_oldfilename}, then a new data file is being processed, and 12905it is necessary to call @code{endfile} for the old file. Since 12906@code{endfile} should only be called if a file has been processed, the 12907program first checks to make sure that @code{_oldfilename} is not the null 12908string. The program then assigns the current file name to 12909@code{_oldfilename}, and calls @code{beginfile} for the file. 12910Since, like all @code{awk} variables, @code{_oldfilename} will be 12911initialized to the null string, this rule executes correctly even for the 12912first data file. 12913 12914The program also supplies an @code{END} rule, to do the final processing for 12915the last file. Since this @code{END} rule comes before any @code{END} rules 12916supplied in the ``main'' program, @code{endfile} will be called first. Once 12917again the value of multiple @code{BEGIN} and @code{END} rules should be clear. 12918 12919@findex beginfile 12920@findex endfile 12921This version has same problem as the first version of @code{nextfile} 12922(@pxref{Nextfile Function, ,Implementing @code{nextfile} as a Function}). 12923If the same data file occurs twice in a row on command line, then 12924@code{endfile} and @code{beginfile} will not be executed at the end of the 12925first pass and at the beginning of the second pass. 12926This version solves the problem. 12927 12928@example 12929@c @group 12930@c file eg/lib/ftrans.awk 12931# ftrans.awk --- handle data file transitions 12932# 12933# user supplies beginfile() and endfile() functions 12934# 12935# Arnold Robbins, arnold@@gnu.org, November 1992 12936# Public Domain 12937 12938FNR == 1 @{ 12939 if (_filename_ != "") 12940 endfile(_filename_) 12941 _filename_ = FILENAME 12942 beginfile(FILENAME) 12943@} 12944 12945END @{ endfile(_filename_) @} 12946@c endfile 12947@c @end group 12948@end example 12949 12950In @ref{Wc Program, ,Counting Things}, 12951you will see how this library function can be used, and 12952how it simplifies writing the main program. 12953 12954@node Getopt Function, Passwd Functions, Filetrans Function, Library Functions 12955@section Processing Command Line Options 12956 12957@cindex @code{getopt}, C version 12958@cindex processing arguments 12959@cindex argument processing 12960Most utilities on POSIX compatible systems take options or ``switches'' on 12961the command line that can be used to change the way a program behaves. 12962@code{awk} is an example of such a program 12963(@pxref{Options, ,Command Line Options}). 12964Often, options take @dfn{arguments}, data that the program needs to 12965correctly obey the command line option. For example, @code{awk}'s 12966@samp{-F} option requires a string to use as the field separator. 12967The first occurrence on the command line of either @samp{--} or a 12968string that does not begin with @samp{-} ends the options. 12969 12970Most Unix systems provide a C function named @code{getopt} for processing 12971command line arguments. The programmer provides a string describing the one 12972letter options. If an option requires an argument, it is followed in the 12973string with a colon. @code{getopt} is also passed the 12974count and values of the command line arguments, and is called in a loop. 12975@code{getopt} processes the command line arguments for option letters. 12976Each time around the loop, it returns a single character representing the 12977next option letter that it found, or @samp{?} if it found an invalid option. 12978When it returns @minus{}1, there are no options left on the command line. 12979 12980When using @code{getopt}, options that do not take arguments can be 12981grouped together. Furthermore, options that take arguments require that the 12982argument be present. The argument can immediately follow the option letter, 12983or it can be a separate command line argument. 12984 12985Given a hypothetical program that takes 12986three command line options, @samp{-a}, @samp{-b}, and @samp{-c}, and 12987@samp{-b} requires an argument, all of the following are valid ways of 12988invoking the program: 12989 12990@example 12991@c @group 12992prog -a -b foo -c data1 data2 data3 12993prog -ac -bfoo -- data1 data2 data3 12994prog -acbfoo data1 data2 data3 12995@c @end group 12996@end example 12997 12998Notice that when the argument is grouped with its option, the rest of 12999the command line argument is considered to be the option's argument. 13000In the above example, @samp{-acbfoo} indicates that all of the 13001@samp{-a}, @samp{-b}, and @samp{-c} options were supplied, 13002and that @samp{foo} is the argument to the @samp{-b} option. 13003 13004@code{getopt} provides four external variables that the programmer can use. 13005 13006@table @code 13007@item optind 13008The index in the argument value array (@code{argv}) where the first 13009non-option command line argument can be found. 13010 13011@item optarg 13012The string value of the argument to an option. 13013 13014@item opterr 13015Usually @code{getopt} prints an error message when it finds an invalid 13016option. Setting @code{opterr} to zero disables this feature. (An 13017application might wish to print its own error message.) 13018 13019@item optopt 13020The letter representing the command line option. 13021While not usually documented, most versions supply this variable. 13022@end table 13023 13024The following C fragment shows how @code{getopt} might process command line 13025arguments for @code{awk}. 13026 13027@example 13028@group 13029int 13030main(int argc, char *argv[]) 13031@{ 13032 @dots{} 13033 /* print our own message */ 13034 opterr = 0; 13035@end group 13036@group 13037 while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) @{ 13038 switch (c) @{ 13039 case 'f': /* file */ 13040 @dots{} 13041 break; 13042 case 'F': /* field separator */ 13043 @dots{} 13044 break; 13045 case 'v': /* variable assignment */ 13046 @dots{} 13047 break; 13048 case 'W': /* extension */ 13049 @dots{} 13050 break; 13051 case '?': 13052 default: 13053 usage(); 13054 break; 13055 @} 13056 @} 13057 @dots{} 13058@} 13059@end group 13060@end example 13061 13062As a side point, @code{gawk} actually uses the GNU @code{getopt_long} 13063function to process both normal and GNU-style long options 13064(@pxref{Options, ,Command Line Options}). 13065 13066The abstraction provided by @code{getopt} is very useful, and would be quite 13067handy in @code{awk} programs as well. Here is an @code{awk} version of 13068@code{getopt}. This function highlights one of the greatest weaknesses in 13069@code{awk}, which is that it is very poor at manipulating single characters. 13070Repeated calls to @code{substr} are necessary for accessing individual 13071characters (@pxref{String Functions, ,Built-in Functions for String Manipulation}). 13072 13073The discussion walks through the code a bit at a time. 13074 13075@example 13076@c @group 13077@c file eg/lib/getopt.awk 13078# getopt --- do C library getopt(3) function in awk 13079# 13080# arnold@@gnu.org 13081# Public domain 13082# 13083# Initial version: March, 1991 13084# Revised: May, 1993 13085 13086@group 13087# External variables: 13088# Optind -- index of ARGV for first non-option argument 13089# Optarg -- string value of argument to current option 13090# Opterr -- if non-zero, print our own diagnostic 13091# Optopt -- current option letter 13092@end group 13093 13094# Returns 13095# -1 at end of options 13096# ? for unrecognized option 13097# <c> a character representing the current option 13098 13099# Private Data 13100# _opti index in multi-flag option, e.g., -abc 13101@c endfile 13102@c @end group 13103@end example 13104 13105The function starts out with some documentation: who wrote the code, 13106and when it was revised, followed by a list of the global variables it uses, 13107what the return values are and what they mean, and any global variables that 13108are ``private'' to this library function. Such documentation is essential 13109for any program, and particularly for library functions. 13110 13111@findex getopt 13112@example 13113@c @group 13114@c file eg/lib/getopt.awk 13115function getopt(argc, argv, options, optl, thisopt, i) 13116@{ 13117 optl = length(options) 13118 if (optl == 0) # no options given 13119 return -1 13120 13121 if (argv[Optind] == "--") @{ # all done 13122 Optind++ 13123 _opti = 0 13124 return -1 13125 @} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) @{ 13126 _opti = 0 13127 return -1 13128 @} 13129@c endfile 13130@c @end group 13131@end example 13132 13133The function first checks that it was indeed called with a string of options 13134(the @code{options} parameter). If @code{options} has a zero length, 13135@code{getopt} immediately returns @minus{}1. 13136 13137The next thing to check for is the end of the options. A @samp{--} ends the 13138command line options, as does any command line argument that does not begin 13139with a @samp{-}. @code{Optind} is used to step through the array of command 13140line arguments; it retains its value across calls to @code{getopt}, since it 13141is a global variable. 13142 13143The regexp used, @code{@w{/^-[^: \t\n\f\r\v\b]/}}, is 13144perhaps a bit of overkill; it checks for a @samp{-} followed by anything 13145that is not whitespace and not a colon. 13146If the current command line argument does not match this pattern, 13147it is not an option, and it ends option processing. 13148 13149@example 13150@group 13151@c file eg/lib/getopt.awk 13152 if (_opti == 0) 13153 _opti = 2 13154 thisopt = substr(argv[Optind], _opti, 1) 13155 Optopt = thisopt 13156 i = index(options, thisopt) 13157 if (i == 0) @{ 13158 if (Opterr) 13159 printf("%c -- invalid option\n", 13160 thisopt) > "/dev/stderr" 13161 if (_opti >= length(argv[Optind])) @{ 13162 Optind++ 13163 _opti = 0 13164 @} else 13165 _opti++ 13166 return "?" 13167 @} 13168@c endfile 13169@end group 13170@end example 13171 13172The @code{_opti} variable tracks the position in the current command line 13173argument (@code{argv[Optind]}). In the case that multiple options were 13174grouped together with one @samp{-} (e.g., @samp{-abx}), it is necessary 13175to return them to the user one at a time. 13176 13177If @code{_opti} is equal to zero, it is set to two, the index in the string 13178of the next character to look at (we skip the @samp{-}, which is at position 13179one). The variable @code{thisopt} holds the character, obtained with 13180@code{substr}. It is saved in @code{Optopt} for the main program to use. 13181 13182If @code{thisopt} is not in the @code{options} string, then it is an 13183invalid option. If @code{Opterr} is non-zero, @code{getopt} prints an error 13184message on the standard error that is similar to the message from the C 13185version of @code{getopt}. 13186 13187Since the option is invalid, it is necessary to skip it and move on to the 13188next option character. If @code{_opti} is greater than or equal to the 13189length of the current command line argument, then it is necessary to move on 13190to the next one, so @code{Optind} is incremented and @code{_opti} is reset 13191to zero. Otherwise, @code{Optind} is left alone and @code{_opti} is merely 13192incremented. 13193 13194In any case, since the option was invalid, @code{getopt} returns @samp{?}. 13195The main program can examine @code{Optopt} if it needs to know what the 13196invalid option letter actually was. 13197 13198@example 13199@group 13200@c file eg/lib/getopt.awk 13201 if (substr(options, i + 1, 1) == ":") @{ 13202 # get option argument 13203 if (length(substr(argv[Optind], _opti + 1)) > 0) 13204 Optarg = substr(argv[Optind], _opti + 1) 13205 else 13206 Optarg = argv[++Optind] 13207 _opti = 0 13208 @} else 13209 Optarg = "" 13210@c endfile 13211@end group 13212@end example 13213 13214If the option requires an argument, the option letter is followed by a colon 13215in the @code{options} string. If there are remaining characters in the 13216current command line argument (@code{argv[Optind]}), then the rest of that 13217string is assigned to @code{Optarg}. Otherwise, the next command line 13218argument is used (@samp{-xFOO} vs. @samp{@w{-x FOO}}). In either case, 13219@code{_opti} is reset to zero, since there are no more characters left to 13220examine in the current command line argument. 13221 13222@example 13223@c @group 13224@c file eg/lib/getopt.awk 13225 if (_opti == 0 || _opti >= length(argv[Optind])) @{ 13226 Optind++ 13227 _opti = 0 13228 @} else 13229 _opti++ 13230 return thisopt 13231@} 13232@c endfile 13233@c @end group 13234@end example 13235 13236Finally, if @code{_opti} is either zero or greater than the length of the 13237current command line argument, it means this element in @code{argv} is 13238through being processed, so @code{Optind} is incremented to point to the 13239next element in @code{argv}. If neither condition is true, then only 13240@code{_opti} is incremented, so that the next option letter can be processed 13241on the next call to @code{getopt}. 13242 13243@example 13244@c @group 13245@c file eg/lib/getopt.awk 13246BEGIN @{ 13247 Opterr = 1 # default is to diagnose 13248 Optind = 1 # skip ARGV[0] 13249 13250 # test program 13251 if (_getopt_test) @{ 13252 while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1) 13253 printf("c = <%c>, optarg = <%s>\n", 13254 _go_c, Optarg) 13255 printf("non-option arguments:\n") 13256 for (; Optind < ARGC; Optind++) 13257 printf("\tARGV[%d] = <%s>\n", 13258 Optind, ARGV[Optind]) 13259 @} 13260@} 13261@c endfile 13262@c @end group 13263@end example 13264 13265The @code{BEGIN} rule initializes both @code{Opterr} and @code{Optind} to one. 13266@code{Opterr} is set to one, since the default behavior is for @code{getopt} 13267to print a diagnostic message upon seeing an invalid option. @code{Optind} 13268is set to one, since there's no reason to look at the program name, which is 13269in @code{ARGV[0]}. 13270 13271The rest of the @code{BEGIN} rule is a simple test program. Here is the 13272result of two sample runs of the test program. 13273 13274@example 13275@group 13276$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x 13277@print{} c = <a>, optarg = <> 13278@print{} c = <c>, optarg = <> 13279@print{} c = <b>, optarg = <ARG> 13280@print{} non-option arguments: 13281@print{} ARGV[3] = <bax> 13282@print{} ARGV[4] = <-x> 13283@end group 13284 13285@group 13286$ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc 13287@print{} c = <a>, optarg = <> 13288@error{} x -- invalid option 13289@print{} c = <?>, optarg = <> 13290@print{} non-option arguments: 13291@print{} ARGV[4] = <xyz> 13292@print{} ARGV[5] = <abc> 13293@end group 13294@end example 13295 13296The first @samp{--} terminates the arguments to @code{awk}, so that it does 13297not try to interpret the @samp{-a} etc. as its own options. 13298 13299Several of the sample programs presented in 13300@ref{Sample Programs, ,Practical @code{awk} Programs}, 13301use @code{getopt} to process their arguments. 13302 13303@node Passwd Functions, Group Functions, Getopt Function, Library Functions 13304@section Reading the User Database 13305 13306@cindex @file{/dev/user} 13307The @file{/dev/user} special file 13308(@pxref{Special Files, ,Special File Names in @code{gawk}}) 13309provides access to the current user's real and effective user and group id 13310numbers, and if available, the user's supplementary group set. 13311However, since these are numbers, they do not provide very useful 13312information to the average user. There needs to be some way to find the 13313user information associated with the user and group numbers. This 13314section presents a suite of functions for retrieving information from the 13315user database. @xref{Group Functions, ,Reading the Group Database}, 13316for a similar suite that retrieves information from the group database. 13317 13318@cindex @code{getpwent}, C version 13319@cindex user information 13320@cindex login information 13321@cindex account information 13322@cindex password file 13323The POSIX standard does not define the file where user information is 13324kept. Instead, it provides the @code{<pwd.h>} header file 13325and several C language subroutines for obtaining user information. 13326The primary function is @code{getpwent}, for ``get password entry.'' 13327The ``password'' comes from the original user database file, 13328@file{/etc/passwd}, which kept user information, along with the 13329encrypted passwords (hence the name). 13330 13331While an @code{awk} program could simply read @file{/etc/passwd} directly 13332(the format is well known), because of the way password 13333files are handled on networked systems, 13334this file may not contain complete information about the system's set of users. 13335 13336@cindex @code{pwcat} program 13337To be sure of being 13338able to produce a readable, complete version of the user database, it is 13339necessary to write a small C program that calls @code{getpwent}. 13340@code{getpwent} is defined to return a pointer to a @code{struct passwd}. 13341Each time it is called, it returns the next entry in the database. 13342When there are no more entries, it returns @code{NULL}, the null pointer. 13343When this happens, the C program should call @code{endpwent} to close the 13344database. 13345Here is @code{pwcat}, a C program that ``cats'' the password database. 13346 13347@findex pwcat.c 13348@example 13349@c @group 13350@c file eg/lib/pwcat.c 13351/* 13352 * pwcat.c 13353 * 13354 * Generate a printable version of the password database 13355 * 13356 * Arnold Robbins 13357 * arnold@@gnu.org 13358 * May 1993 13359 * Public Domain 13360 */ 13361 13362#include <stdio.h> 13363#include <pwd.h> 13364 13365int 13366main(argc, argv) 13367int argc; 13368char **argv; 13369@{ 13370 struct passwd *p; 13371 13372 while ((p = getpwent()) != NULL) 13373 printf("%s:%s:%d:%d:%s:%s:%s\n", 13374 p->pw_name, p->pw_passwd, p->pw_uid, 13375 p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell); 13376 13377 endpwent(); 13378 exit(0); 13379@} 13380@c endfile 13381@c @end group 13382@end example 13383 13384If you don't understand C, don't worry about it. 13385The output from @code{pwcat} is the user database, in the traditional 13386@file{/etc/passwd} format of colon-separated fields. The fields are: 13387 13388@table @asis 13389@item Login name 13390The user's login name. 13391 13392@item Encrypted password 13393The user's encrypted password. This may not be available on some systems. 13394 13395@item User-ID 13396The user's numeric user-id number. 13397 13398@item Group-ID 13399The user's numeric group-id number. 13400 13401@item Full name 13402The user's full name, and perhaps other information associated with the 13403user. 13404 13405@item Home directory 13406The user's login, or ``home'' directory (familiar to shell programmers as 13407@code{$HOME}). 13408 13409@item Login shell 13410The program that will be run when the user logs in. This is usually a 13411shell, such as Bash (the Gnu Bourne-Again shell). 13412@end table 13413 13414Here are a few lines representative of @code{pwcat}'s output. 13415 13416@example 13417@c @group 13418$ pwcat 13419@print{} root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh 13420@print{} nobody:*:65534:65534::/: 13421@print{} daemon:*:1:1::/: 13422@print{} sys:*:2:2::/:/bin/csh 13423@print{} bin:*:3:3::/bin: 13424@print{} arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh 13425@print{} miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh 13426@print{} andy:abcca2:113:10:Andy Jacobs:/home/andy:/bin/sh 13427@dots{} 13428@c @end group 13429@end example 13430 13431With that introduction, here is a group of functions for getting user 13432information. There are several functions here, corresponding to the C 13433functions of the same name. 13434 13435@findex _pw_init 13436@example 13437@c file eg/lib/passwdawk.in 13438@group 13439# passwd.awk --- access password file information 13440# Arnold Robbins, arnold@@gnu.org, Public Domain 13441# May 1993 13442 13443BEGIN @{ 13444 # tailor this to suit your system 13445 _pw_awklib = "/usr/local/libexec/awk/" 13446@} 13447@end group 13448 13449@group 13450function _pw_init( oldfs, oldrs, olddol0, pwcat) 13451@{ 13452 if (_pw_inited) 13453 return 13454 oldfs = FS 13455 oldrs = RS 13456 olddol0 = $0 13457 FS = ":" 13458 RS = "\n" 13459 pwcat = _pw_awklib "pwcat" 13460 while ((pwcat | getline) > 0) @{ 13461 _pw_byname[$1] = $0 13462 _pw_byuid[$3] = $0 13463 _pw_bycount[++_pw_total] = $0 13464 @} 13465 close(pwcat) 13466 _pw_count = 0 13467 _pw_inited = 1 13468 FS = oldfs 13469 RS = oldrs 13470 $0 = olddol0 13471@} 13472@c endfile 13473@end group 13474@end example 13475 13476The @code{BEGIN} rule sets a private variable to the directory where 13477@code{pwcat} is stored. Since it is used to help out an @code{awk} library 13478routine, we have chosen to put it in @file{/usr/local/libexec/awk}. 13479You might want it to be in a different directory on your system. 13480 13481The function @code{_pw_init} keeps three copies of the user information 13482in three associative arrays. The arrays are indexed by user name 13483(@code{_pw_byname}), by user-id number (@code{_pw_byuid}), and by order of 13484occurrence (@code{_pw_bycount}). 13485 13486The variable @code{_pw_inited} is used for efficiency; @code{_pw_init} only 13487needs to be called once. 13488 13489Since this function uses @code{getline} to read information from 13490@code{pwcat}, it first saves the values of @code{FS}, @code{RS}, and 13491@code{$0}. Doing so is necessary, since these functions could be called 13492from anywhere within a user's program, and the user may have his or her 13493own values for @code{FS} and @code{RS}. 13494@ignore 13495Problem, what if FIELDWIDTHS is in use? Sigh. 13496@end ignore 13497 13498The main part of the function uses a loop to read database lines, split 13499the line into fields, and then store the line into each array as necessary. 13500When the loop is done, @code{@w{_pw_init}} cleans up by closing the pipeline, 13501setting @code{@w{_pw_inited}} to one, and restoring @code{FS}, @code{RS}, and 13502@code{$0}. The use of @code{@w{_pw_count}} will be explained below. 13503 13504@findex getpwnam 13505@example 13506@group 13507@c file eg/lib/passwdawk.in 13508function getpwnam(name) 13509@{ 13510 _pw_init() 13511 if (name in _pw_byname) 13512 return _pw_byname[name] 13513 return "" 13514@} 13515@c endfile 13516@end group 13517@end example 13518 13519The @code{getpwnam} function takes a user name as a string argument. If that 13520user is in the database, it returns the appropriate line. Otherwise it 13521returns the null string. 13522 13523@findex getpwuid 13524@example 13525@group 13526@c file eg/lib/passwdawk.in 13527function getpwuid(uid) 13528@{ 13529 _pw_init() 13530 if (uid in _pw_byuid) 13531 return _pw_byuid[uid] 13532 return "" 13533@} 13534@c endfile 13535@end group 13536@end example 13537 13538Similarly, 13539the @code{getpwuid} function takes a user-id number argument. If that 13540user number is in the database, it returns the appropriate line. Otherwise it 13541returns the null string. 13542 13543@findex getpwent 13544@example 13545@c @group 13546@c file eg/lib/passwdawk.in 13547function getpwent() 13548@{ 13549 _pw_init() 13550 if (_pw_count < _pw_total) 13551 return _pw_bycount[++_pw_count] 13552 return "" 13553@} 13554@c endfile 13555@c @end group 13556@end example 13557 13558The @code{getpwent} function simply steps through the database, one entry at 13559a time. It uses @code{_pw_count} to track its current position in the 13560@code{_pw_bycount} array. 13561 13562@findex endpwent 13563@example 13564@c @group 13565@c file eg/lib/passwdawk.in 13566function endpwent() 13567@{ 13568 _pw_count = 0 13569@} 13570@c endfile 13571@c @end group 13572@end example 13573 13574The @code{@w{endpwent}} function resets @code{@w{_pw_count}} to zero, so that 13575subsequent calls to @code{getpwent} will start over again. 13576 13577A conscious design decision in this suite is that each subroutine calls 13578@code{@w{_pw_init}} to initialize the database arrays. The overhead of running 13579a separate process to generate the user database, and the I/O to scan it, 13580will only be incurred if the user's main program actually calls one of these 13581functions. If this library file is loaded along with a user's program, but 13582none of the routines are ever called, then there is no extra run-time overhead. 13583(The alternative would be to move the body of @code{@w{_pw_init}} into a 13584@code{BEGIN} rule, which would always run @code{pwcat}. This simplifies the 13585code but runs an extra process that may never be needed.) 13586 13587In turn, calling @code{_pw_init} is not too expensive, since the 13588@code{_pw_inited} variable keeps the program from reading the data more than 13589once. If you are worried about squeezing every last cycle out of your 13590@code{awk} program, the check of @code{_pw_inited} could be moved out of 13591@code{_pw_init} and duplicated in all the other functions. In practice, 13592this is not necessary, since most @code{awk} programs are I/O bound, and it 13593would clutter up the code. 13594 13595The @code{id} program in @ref{Id Program, ,Printing Out User Information}, 13596uses these functions. 13597 13598@node Group Functions, Library Names, Passwd Functions, Library Functions 13599@section Reading the Group Database 13600 13601@cindex @code{getgrent}, C version 13602@cindex group information 13603@cindex account information 13604@cindex group file 13605Much of the discussion presented in 13606@ref{Passwd Functions, ,Reading the User Database}, 13607applies to the group database as well. Although there has traditionally 13608been a well known file, @file{/etc/group}, in a well known format, the POSIX 13609standard only provides a set of C library routines 13610(@code{<grp.h>} and @code{getgrent}) 13611for accessing the information. 13612Even though this file may exist, it likely does not have 13613complete information. Therefore, as with the user database, it is necessary 13614to have a small C program that generates the group database as its output. 13615 13616@cindex @code{grcat} program 13617Here is @code{grcat}, a C program that ``cats'' the group database. 13618 13619@findex grcat.c 13620@example 13621@c @group 13622@c file eg/lib/grcat.c 13623/* 13624 * grcat.c 13625 * 13626 * Generate a printable version of the group database 13627 * 13628 * Arnold Robbins, arnold@@gnu.org 13629 * May 1993 13630 * Public Domain 13631 */ 13632 13633#include <stdio.h> 13634#include <grp.h> 13635 13636@group 13637int 13638main(argc, argv) 13639int argc; 13640char **argv; 13641@{ 13642 struct group *g; 13643 int i; 13644@end group 13645 13646@group 13647 while ((g = getgrent()) != NULL) @{ 13648 printf("%s:%s:%d:", g->gr_name, g->gr_passwd, 13649 g->gr_gid); 13650@end group 13651 for (i = 0; g->gr_mem[i] != NULL; i++) @{ 13652 printf("%s", g->gr_mem[i]); 13653 if (g->gr_mem[i+1] != NULL) 13654 putchar(','); 13655 @} 13656 putchar('\n'); 13657 @} 13658 endgrent(); 13659 exit(0); 13660@} 13661@c endfile 13662@c @end group 13663@end example 13664 13665Each line in the group database represent one group. The fields are 13666separated with colons, and represent the following information. 13667 13668@table @asis 13669@item Group Name 13670The name of the group. 13671 13672@item Group Password 13673The encrypted group password. In practice, this field is never used. It is 13674usually empty, or set to @samp{*}. 13675 13676@item Group ID Number 13677The numeric group-id number. This number should be unique within the file. 13678 13679@item Group Member List 13680A comma-separated list of user names. These users are members of the group. 13681Most Unix systems allow users to be members of several groups 13682simultaneously. If your system does, then reading @file{/dev/user} will 13683return those group-id numbers in @code{$5} through @code{$NF}. 13684(Note that @file{/dev/user} is a @code{gawk} extension; 13685@pxref{Special Files, ,Special File Names in @code{gawk}}.) 13686@end table 13687 13688Here is what running @code{grcat} might produce: 13689 13690@example 13691@group 13692$ grcat 13693@print{} wheel:*:0:arnold 13694@print{} nogroup:*:65534: 13695@print{} daemon:*:1: 13696@print{} kmem:*:2: 13697@print{} staff:*:10:arnold,miriam,andy 13698@print{} other:*:20: 13699@dots{} 13700@end group 13701@end example 13702 13703Here are the functions for obtaining information from the group database. 13704There are several, modeled after the C library functions of the same names. 13705 13706@findex _gr_init 13707@example 13708@group 13709@c file eg/lib/groupawk.in 13710# group.awk --- functions for dealing with the group file 13711# Arnold Robbins, arnold@@gnu.org, Public Domain 13712# May 1993 13713 13714BEGIN \ 13715@{ 13716 # Change to suit your system 13717 _gr_awklib = "/usr/local/libexec/awk/" 13718@} 13719@c endfile 13720@end group 13721 13722@group 13723@c file eg/lib/groupawk.in 13724function _gr_init( oldfs, oldrs, olddol0, grcat, n, a, i) 13725@{ 13726 if (_gr_inited) 13727 return 13728@end group 13729 13730@group 13731 oldfs = FS 13732 oldrs = RS 13733 olddol0 = $0 13734 FS = ":" 13735 RS = "\n" 13736@end group 13737 13738@group 13739 grcat = _gr_awklib "grcat" 13740 while ((grcat | getline) > 0) @{ 13741 if ($1 in _gr_byname) 13742 _gr_byname[$1] = _gr_byname[$1] "," $4 13743 else 13744 _gr_byname[$1] = $0 13745 if ($3 in _gr_bygid) 13746 _gr_bygid[$3] = _gr_bygid[$3] "," $4 13747 else 13748 _gr_bygid[$3] = $0 13749 13750 n = split($4, a, "[ \t]*,[ \t]*") 13751@end group 13752@group 13753 for (i = 1; i <= n; i++) 13754 if (a[i] in _gr_groupsbyuser) 13755 _gr_groupsbyuser[a[i]] = \ 13756 _gr_groupsbyuser[a[i]] " " $1 13757 else 13758 _gr_groupsbyuser[a[i]] = $1 13759@end group 13760 13761@group 13762 _gr_bycount[++_gr_count] = $0 13763 @} 13764@end group 13765@group 13766 close(grcat) 13767 _gr_count = 0 13768 _gr_inited++ 13769 FS = oldfs 13770 RS = oldrs 13771 $0 = olddol0 13772@} 13773@c endfile 13774@end group 13775@end example 13776 13777The @code{BEGIN} rule sets a private variable to the directory where 13778@code{grcat} is stored. Since it is used to help out an @code{awk} library 13779routine, we have chosen to put it in @file{/usr/local/libexec/awk}. You might 13780want it to be in a different directory on your system. 13781 13782These routines follow the same general outline as the user database routines 13783(@pxref{Passwd Functions, ,Reading the User Database}). 13784The @code{@w{_gr_inited}} variable is used to 13785ensure that the database is scanned no more than once. 13786The @code{@w{_gr_init}} function first saves @code{FS}, @code{RS}, and 13787@code{$0}, and then sets @code{FS} and @code{RS} to the correct values for 13788scanning the group information. 13789 13790The group information is stored is several associative arrays. 13791The arrays are indexed by group name (@code{@w{_gr_byname}}), by group-id number 13792(@code{@w{_gr_bygid}}), and by position in the database (@code{@w{_gr_bycount}}). 13793There is an additional array indexed by user name (@code{@w{_gr_groupsbyuser}}), 13794that is a space separated list of groups that each user belongs to. 13795 13796Unlike the user database, it is possible to have multiple records in the 13797database for the same group. This is common when a group has a large number 13798of members. Such a pair of entries might look like: 13799 13800@example 13801tvpeople:*:101:johny,jay,arsenio 13802tvpeople:*:101:david,conan,tom,joan 13803@end example 13804 13805For this reason, @code{_gr_init} looks to see if a group name or 13806group-id number has already been seen. If it has, then the user names are 13807simply concatenated onto the previous list of users. (There is actually a 13808subtle problem with the code presented above. Suppose that 13809the first time there were no names. This code adds the names with 13810a leading comma. It also doesn't check that there is a @code{$4}.) 13811 13812Finally, @code{_gr_init} closes the pipeline to @code{grcat}, restores 13813@code{FS}, @code{RS}, and @code{$0}, initializes @code{_gr_count} to zero 13814(it is used later), and makes @code{_gr_inited} non-zero. 13815 13816@findex getgrnam 13817@example 13818@c @group 13819@c file eg/lib/groupawk.in 13820function getgrnam(group) 13821@{ 13822 _gr_init() 13823 if (group in _gr_byname) 13824 return _gr_byname[group] 13825 return "" 13826@} 13827@c endfile 13828@c @end group 13829@end example 13830 13831The @code{getgrnam} function takes a group name as its argument, and if that 13832group exists, it is returned. Otherwise, @code{getgrnam} returns the null 13833string. 13834 13835@findex getgrgid 13836@example 13837@c @group 13838@c file eg/lib/groupawk.in 13839function getgrgid(gid) 13840@{ 13841 _gr_init() 13842 if (gid in _gr_bygid) 13843 return _gr_bygid[gid] 13844 return "" 13845@} 13846@c endfile 13847@c @end group 13848@end example 13849 13850The @code{getgrgid} function is similar, it takes a numeric group-id, and 13851looks up the information associated with that group-id. 13852 13853@findex getgruser 13854@example 13855@group 13856@c file eg/lib/groupawk.in 13857function getgruser(user) 13858@{ 13859 _gr_init() 13860 if (user in _gr_groupsbyuser) 13861 return _gr_groupsbyuser[user] 13862 return "" 13863@} 13864@c endfile 13865@end group 13866@end example 13867 13868The @code{getgruser} function does not have a C counterpart. It takes a 13869user name, and returns the list of groups that have the user as a member. 13870 13871@findex getgrent 13872@example 13873@c @group 13874@c file eg/lib/groupawk.in 13875function getgrent() 13876@{ 13877 _gr_init() 13878 if (++_gr_count in _gr_bycount) 13879 return _gr_bycount[_gr_count] 13880 return "" 13881@} 13882@c endfile 13883@c @end group 13884@end example 13885 13886The @code{getgrent} function steps through the database one entry at a time. 13887It uses @code{_gr_count} to track its position in the list. 13888 13889@findex endgrent 13890@example 13891@group 13892@c file eg/lib/groupawk.in 13893function endgrent() 13894@{ 13895 _gr_count = 0 13896@} 13897@c endfile 13898@end group 13899@end example 13900 13901@code{endgrent} resets @code{_gr_count} to zero so that @code{getgrent} can 13902start over again. 13903 13904As with the user database routines, each function calls @code{_gr_init} to 13905initialize the arrays. Doing so only incurs the extra overhead of running 13906@code{grcat} if these functions are used (as opposed to moving the body of 13907@code{_gr_init} into a @code{BEGIN} rule). 13908 13909Most of the work is in scanning the database and building the various 13910associative arrays. The functions that the user calls are themselves very 13911simple, relying on @code{awk}'s associative arrays to do work. 13912 13913The @code{id} program in @ref{Id Program, ,Printing Out User Information}, 13914uses these functions. 13915 13916@node Library Names, , Group Functions, Library Functions 13917@section Naming Library Function Global Variables 13918 13919@cindex namespace issues in @code{awk} 13920@cindex documenting @code{awk} programs 13921@cindex programs, documenting 13922Due to the way the @code{awk} language evolved, variables are either 13923@dfn{global} (usable by the entire program), or @dfn{local} (usable just by 13924a specific function). There is no intermediate state analogous to 13925@code{static} variables in C. 13926 13927Library functions often need to have global variables that they can use to 13928preserve state information between calls to the function. For example, 13929@code{getopt}'s variable @code{_opti} 13930(@pxref{Getopt Function, ,Processing Command Line Options}), 13931and the @code{_tm_months} array used by @code{mktime} 13932(@pxref{Mktime Function, ,Turning Dates Into Timestamps}). 13933Such variables are called @dfn{private}, since the only functions that need to 13934use them are the ones in the library. 13935 13936When writing a library function, you should try to choose names for your 13937private variables so that they will not conflict with any variables used by 13938either another library function or a user's main program. For example, a 13939name like @samp{i} or @samp{j} is not a good choice, since user programs 13940often use variable names like these for their own purposes. 13941 13942The example programs shown in this chapter all start the names of their 13943private variables with an underscore (@samp{_}). Users generally don't use 13944leading underscores in their variable names, so this convention immediately 13945decreases the chances that the variable name will be accidentally shared 13946with the user's program. 13947 13948In addition, several of the library functions use a prefix that helps 13949indicate what function or set of functions uses the variables. For example, 13950@code{_tm_months} in @code{mktime} 13951(@pxref{Mktime Function, ,Turning Dates Into Timestamps}), and 13952@code{_pw_byname} in the user data base routines 13953(@pxref{Passwd Functions, ,Reading the User Database}). 13954This convention is recommended, since it even further decreases the chance 13955of inadvertent conflict among variable names. 13956Note that this convention can be used equally well both for variable names 13957and for private function names too. 13958 13959While I could have re-written all the library routines to use this 13960convention, I did not do so, in order to show how my own @code{awk} 13961programming style has evolved, and to provide some basis for this 13962discussion. 13963 13964As a final note on variable naming, if a function makes global variables 13965available for use by a main program, it is a good convention to start that 13966variable's name with a capital letter. 13967For example, @code{getopt}'s @code{Opterr} and @code{Optind} variables 13968(@pxref{Getopt Function, ,Processing Command Line Options}). 13969The leading capital letter indicates that it is global, while the fact that 13970the variable name is not all capital letters indicates that the variable is 13971not one of @code{awk}'s built-in variables, like @code{FS}. 13972 13973It is also important that @emph{all} variables in library functions 13974that do not need to save state are in fact declared local. If this is 13975not done, the variable could accidentally be used in the user's program, 13976leading to bugs that are very difficult to track down. 13977 13978@example 13979function lib_func(x, y, l1, l2) 13980@{ 13981 @dots{} 13982 @var{use variable} some_var # some_var could be local 13983 @dots{} # but is not by oversight 13984@} 13985@end example 13986 13987@cindex Tcl 13988A different convention, common in the Tcl community, is to use a single 13989associative array to hold the values needed by the library function(s), or 13990``package.'' This significantly decreases the number of actual global names 13991in use. For example, the functions described in 13992@ref{Passwd Functions, , Reading the User Database}, 13993might have used @code{@w{PW_data["inited"]}}, @code{@w{PW_data["total"]}}, 13994@code{@w{PW_data["count"]}} and @code{@w{PW_data["awklib"]}}, instead of 13995@code{@w{_pw_inited}}, @code{@w{_pw_awklib}}, @code{@w{_pw_total}}, 13996and @code{@w{_pw_count}}. 13997 13998The conventions presented in this section are exactly that, conventions. You 13999are not required to write your programs this way, we merely recommend that 14000you do so. 14001 14002@node Sample Programs, Language History, Library Functions, Top 14003@chapter Practical @code{awk} Programs 14004 14005This chapter presents a potpourri of @code{awk} programs for your reading 14006enjoyment. 14007@iftex 14008There are two sections. The first presents @code{awk} 14009versions of several common POSIX utilities. 14010The second is a grab-bag of interesting programs. 14011@end iftex 14012 14013Many of these programs use the library functions presented in 14014@ref{Library Functions, ,A Library of @code{awk} Functions}. 14015 14016@menu 14017* Clones:: Clones of common utilities. 14018* Miscellaneous Programs:: Some interesting @code{awk} programs. 14019@end menu 14020 14021@node Clones, Miscellaneous Programs, Sample Programs, Sample Programs 14022@section Re-inventing Wheels for Fun and Profit 14023 14024This section presents a number of POSIX utilities that are implemented in 14025@code{awk}. Re-inventing these programs in @code{awk} is often enjoyable, 14026since the algorithms can be very clearly expressed, and usually the code is 14027very concise and simple. This is true because @code{awk} does so much for you. 14028 14029It should be noted that these programs are not necessarily intended to 14030replace the installed versions on your system. Instead, their 14031purpose is to illustrate @code{awk} language programming for ``real world'' 14032tasks. 14033 14034The programs are presented in alphabetical order. 14035 14036@menu 14037* Cut Program:: The @code{cut} utility. 14038* Egrep Program:: The @code{egrep} utility. 14039* Id Program:: The @code{id} utility. 14040* Split Program:: The @code{split} utility. 14041* Tee Program:: The @code{tee} utility. 14042* Uniq Program:: The @code{uniq} utility. 14043* Wc Program:: The @code{wc} utility. 14044@end menu 14045 14046@node Cut Program, Egrep Program, Clones, Clones 14047@subsection Cutting Out Fields and Columns 14048 14049@cindex @code{cut} utility 14050The @code{cut} utility selects, or ``cuts,'' either characters or fields 14051from its standard 14052input and sends them to its standard output. @code{cut} can cut out either 14053a list of characters, or a list of fields. By default, fields are separated 14054by tabs, but you may supply a command line option to change the field 14055@dfn{delimiter}, i.e.@: the field separator character. @code{cut}'s definition 14056of fields is less general than @code{awk}'s. 14057 14058A common use of @code{cut} might be to pull out just the login name of 14059logged-on users from the output of @code{who}. For example, the following 14060pipeline generates a sorted, unique list of the logged on users: 14061 14062@example 14063who | cut -c1-8 | sort | uniq 14064@end example 14065 14066The options for @code{cut} are: 14067 14068@table @code 14069@item -c @var{list} 14070Use @var{list} as the list of characters to cut out. Items within the list 14071may be separated by commas, and ranges of characters can be separated with 14072dashes. The list @samp{1-8,15,22-35} specifies characters one through 14073eight, 15, and 22 through 35. 14074 14075@item -f @var{list} 14076Use @var{list} as the list of fields to cut out. 14077 14078@item -d @var{delim} 14079Use @var{delim} as the field separator character instead of the tab 14080character. 14081 14082@item -s 14083Suppress printing of lines that do not contain the field delimiter. 14084@end table 14085 14086The @code{awk} implementation of @code{cut} uses the @code{getopt} library 14087function (@pxref{Getopt Function, ,Processing Command Line Options}), 14088and the @code{join} library function 14089(@pxref{Join Function, ,Merging an Array Into a String}). 14090 14091The program begins with a comment describing the options and a @code{usage} 14092function which prints out a usage message and exits. @code{usage} is called 14093if invalid arguments are supplied. 14094 14095@findex cut.awk 14096@example 14097@c @group 14098@c file eg/prog/cut.awk 14099# cut.awk --- implement cut in awk 14100# Arnold Robbins, arnold@@gnu.org, Public Domain 14101# May 1993 14102 14103# Options: 14104# -f list Cut fields 14105# -d c Field delimiter character 14106# -c list Cut characters 14107# 14108# -s Suppress lines without the delimiter character 14109 14110function usage( e1, e2) 14111@{ 14112 e1 = "usage: cut [-f list] [-d c] [-s] [files...]" 14113 e2 = "usage: cut [-c list] [files...]" 14114 print e1 > "/dev/stderr" 14115 print e2 > "/dev/stderr" 14116 exit 1 14117@} 14118@c endfile 14119@c @end group 14120@end example 14121 14122@noindent 14123The variables @code{e1} and @code{e2} are used so that the function 14124fits nicely on the 14125@iftex 14126page. 14127@end iftex 14128@ifinfo 14129screen. 14130@end ifinfo 14131 14132Next comes a @code{BEGIN} rule that parses the command line options. 14133It sets @code{FS} to a single tab character, since that is @code{cut}'s 14134default field separator. The output field separator is also set to be the 14135same as the input field separator. Then @code{getopt} is used to step 14136through the command line options. One or the other of the variables 14137@code{by_fields} or @code{by_chars} is set to true, to indicate that 14138processing should be done by fields or by characters respectively. 14139When cutting by characters, the output field separator is set to the null 14140string. 14141 14142@example 14143@c @group 14144@c file eg/prog/cut.awk 14145BEGIN \ 14146@{ 14147 FS = "\t" # default 14148 OFS = FS 14149 while ((c = getopt(ARGC, ARGV, "sf:c:d:")) != -1) @{ 14150 if (c == "f") @{ 14151 by_fields = 1 14152 fieldlist = Optarg 14153 @} else if (c == "c") @{ 14154 by_chars = 1 14155 fieldlist = Optarg 14156 OFS = "" 14157@group 14158 @} else if (c == "d") @{ 14159 if (length(Optarg) > 1) @{ 14160 printf("Using first character of %s" \ 14161 " for delimiter\n", Optarg) > "/dev/stderr" 14162 Optarg = substr(Optarg, 1, 1) 14163 @} 14164 FS = Optarg 14165 OFS = FS 14166 if (FS == " ") # defeat awk semantics 14167 FS = "[ ]" 14168 @} else if (c == "s") 14169 suppress++ 14170 else 14171 usage() 14172 @} 14173@end group 14174 14175 for (i = 1; i < Optind; i++) 14176 ARGV[i] = "" 14177@c endfile 14178@c @end group 14179@end example 14180 14181Special care is taken when the field delimiter is a space. Using 14182@code{@w{" "}} (a single space) for the value of @code{FS} is 14183incorrect---@code{awk} would 14184separate fields with runs of spaces, tabs and/or newlines, and we want them to be 14185separated with individual spaces. Also, note that after @code{getopt} is 14186through, we have to clear out all the elements of @code{ARGV} from one to 14187@code{Optind}, so that @code{awk} will not try to process the command line 14188options as file names. 14189 14190After dealing with the command line options, the program verifies that the 14191options make sense. Only one or the other of @samp{-c} and @samp{-f} should 14192be used, and both require a field list. Then either @code{set_fieldlist} or 14193@code{set_charlist} is called to pull apart the list of fields or 14194characters. 14195 14196@example 14197@c @group 14198@c file eg/prog/cut.awk 14199 if (by_fields && by_chars) 14200 usage() 14201 14202 if (by_fields == 0 && by_chars == 0) 14203 by_fields = 1 # default 14204 14205 if (fieldlist == "") @{ 14206 print "cut: needs list for -c or -f" > "/dev/stderr" 14207 exit 1 14208 @} 14209 14210@group 14211 if (by_fields) 14212 set_fieldlist() 14213 else 14214 set_charlist() 14215@} 14216@c endfile 14217@end group 14218@end example 14219 14220Here is @code{set_fieldlist}. It first splits the field list apart 14221at the commas, into an array. Then, for each element of the array, it 14222looks to see if it is actually a range, and if so splits it apart. The range 14223is verified to make sure the first number is smaller than the second. 14224Each number in the list is added to the @code{flist} array, which simply 14225lists the fields that will be printed. 14226Normal field splitting is used. 14227The program lets @code{awk} 14228handle the job of doing the field splitting. 14229 14230@example 14231@c @group 14232@c file eg/prog/cut.awk 14233function set_fieldlist( n, m, i, j, k, f, g) 14234@{ 14235 n = split(fieldlist, f, ",") 14236 j = 1 # index in flist 14237 for (i = 1; i <= n; i++) @{ 14238 if (index(f[i], "-") != 0) @{ # a range 14239 m = split(f[i], g, "-") 14240 if (m != 2 || g[1] >= g[2]) @{ 14241 printf("bad field list: %s\n", 14242 f[i]) > "/dev/stderr" 14243 exit 1 14244 @} 14245 for (k = g[1]; k <= g[2]; k++) 14246 flist[j++] = k 14247 @} else 14248 flist[j++] = f[i] 14249 @} 14250 nfields = j - 1 14251@} 14252@c endfile 14253@c @end group 14254@end example 14255 14256The @code{set_charlist} function is more complicated than @code{set_fieldlist}. 14257The idea here is to use @code{gawk}'s @code{FIELDWIDTHS} variable 14258(@pxref{Constant Size, ,Reading Fixed-width Data}), 14259which describes constant width input. When using a character list, that is 14260exactly what we have. 14261 14262Setting up @code{FIELDWIDTHS} is more complicated than simply listing the 14263fields that need to be printed. We have to keep track of the fields to be 14264printed, and also the intervening characters that have to be skipped. 14265For example, suppose you wanted characters one through eight, 15, and 1426622 through 35. You would use @samp{-c 1-8,15,22-35}. The necessary value 14267for @code{FIELDWIDTHS} would be @code{@w{"8 6 1 6 14"}}. This gives us five 14268fields, and what should be printed are @code{$1}, @code{$3}, and @code{$5}. 14269The intermediate fields are ``filler,'' stuff in between the desired data. 14270 14271@code{flist} lists the fields to be printed, and @code{t} tracks the 14272complete field list, including filler fields. 14273 14274@example 14275@c @group 14276@c file eg/prog/cut.awk 14277function set_charlist( field, i, j, f, g, t, 14278 filler, last, len) 14279@{ 14280 field = 1 # count total fields 14281 n = split(fieldlist, f, ",") 14282 j = 1 # index in flist 14283 for (i = 1; i <= n; i++) @{ 14284 if (index(f[i], "-") != 0) @{ # range 14285 m = split(f[i], g, "-") 14286 if (m != 2 || g[1] >= g[2]) @{ 14287 printf("bad character list: %s\n", 14288 f[i]) > "/dev/stderr" 14289 exit 1 14290 @} 14291 len = g[2] - g[1] + 1 14292 if (g[1] > 1) # compute length of filler 14293 filler = g[1] - last - 1 14294 else 14295 filler = 0 14296 if (filler) 14297 t[field++] = filler 14298 t[field++] = len # length of field 14299 last = g[2] 14300 flist[j++] = field - 1 14301 @} else @{ 14302 if (f[i] > 1) 14303 filler = f[i] - last - 1 14304 else 14305 filler = 0 14306 if (filler) 14307 t[field++] = filler 14308 t[field++] = 1 14309 last = f[i] 14310 flist[j++] = field - 1 14311 @} 14312 @} 14313@group 14314 FIELDWIDTHS = join(t, 1, field - 1) 14315 nfields = j - 1 14316@} 14317@end group 14318@c endfile 14319@end example 14320 14321Here is the rule that actually processes the data. If the @samp{-s} option 14322was given, then @code{suppress} will be true. The first @code{if} statement 14323makes sure that the input record does have the field separator. If 14324@code{cut} is processing fields, @code{suppress} is true, and the field 14325separator character is not in the record, then the record is skipped. 14326 14327If the record is valid, then at this point, @code{gawk} has split the data 14328into fields, either using the character in @code{FS} or using fixed-length 14329fields and @code{FIELDWIDTHS}. The loop goes through the list of fields 14330that should be printed. If the corresponding field has data in it, it is 14331printed. If the next field also has data, then the separator character is 14332written out in between the fields. 14333 14334@c 2e: Could use `index($0, FS) != 0' instead of `$0 !~ FS', below 14335 14336@example 14337@c @group 14338@c file eg/prog/cut.awk 14339@{ 14340 if (by_fields && suppress && $0 !~ FS) 14341 next 14342 14343 for (i = 1; i <= nfields; i++) @{ 14344 if ($flist[i] != "") @{ 14345 printf "%s", $flist[i] 14346 if (i < nfields && $flist[i+1] != "") 14347 printf "%s", OFS 14348 @} 14349 @} 14350 print "" 14351@} 14352@c endfile 14353@c @end group 14354@end example 14355 14356This version of @code{cut} relies on @code{gawk}'s @code{FIELDWIDTHS} 14357variable to do the character-based cutting. While it would be possible in 14358other @code{awk} implementations to use @code{substr} 14359(@pxref{String Functions, ,Built-in Functions for String Manipulation}), 14360it would also be extremely painful to do so. 14361The @code{FIELDWIDTHS} variable supplies an elegant solution to the problem 14362of picking the input line apart by characters. 14363 14364@node Egrep Program, Id Program, Cut Program, Clones 14365@subsection Searching for Regular Expressions in Files 14366 14367@cindex @code{egrep} utility 14368The @code{egrep} utility searches files for patterns. It uses regular 14369expressions that are almost identical to those available in @code{awk} 14370(@pxref{Regexp Constants, ,Regular Expression Constants}). It is used this way: 14371 14372@example 14373egrep @r{[} @var{options} @r{]} '@var{pattern}' @var{files} @dots{} 14374@end example 14375 14376The @var{pattern} is a regexp. 14377In typical usage, the regexp is quoted to prevent the shell from expanding 14378any of the special characters as file name wildcards. 14379Normally, @code{egrep} prints the 14380lines that matched. If multiple file names are provided on the command 14381line, each output line is preceded by the name of the file and a colon. 14382 14383The options are: 14384 14385@table @code 14386@item -c 14387Print out a count of the lines that matched the pattern, instead of the 14388lines themselves. 14389 14390@item -s 14391Be silent. No output is produced, and the exit value indicates whether 14392or not the pattern was matched. 14393 14394@item -v 14395Invert the sense of the test. @code{egrep} prints the lines that do 14396@emph{not} match the pattern, and exits successfully if the pattern was not 14397matched. 14398 14399@item -i 14400Ignore case distinctions in both the pattern and the input data. 14401 14402@item -l 14403Only print the names of the files that matched, not the lines that matched. 14404 14405@item -e @var{pattern} 14406Use @var{pattern} as the regexp to match. The purpose of the @samp{-e} 14407option is to allow patterns that start with a @samp{-}. 14408@end table 14409 14410This version uses the @code{getopt} library function 14411(@pxref{Getopt Function, ,Processing Command Line Options}), 14412and the file transition library program 14413(@pxref{Filetrans Function, ,Noting Data File Boundaries}). 14414 14415The program begins with a descriptive comment, and then a @code{BEGIN} rule 14416that processes the command line arguments with @code{getopt}. The @samp{-i} 14417(ignore case) option is particularly easy with @code{gawk}; we just use the 14418@code{IGNORECASE} built in variable 14419(@pxref{Built-in Variables}). 14420 14421@findex egrep.awk 14422@example 14423@c @group 14424@c file eg/prog/egrep.awk 14425# egrep.awk --- simulate egrep in awk 14426# Arnold Robbins, arnold@@gnu.org, Public Domain 14427# May 1993 14428 14429# Options: 14430# -c count of lines 14431# -s silent - use exit value 14432# -v invert test, success if no match 14433# -i ignore case 14434# -l print filenames only 14435# -e argument is pattern 14436 14437BEGIN @{ 14438 while ((c = getopt(ARGC, ARGV, "ce:svil")) != -1) @{ 14439 if (c == "c") 14440 count_only++ 14441 else if (c == "s") 14442 no_print++ 14443 else if (c == "v") 14444 invert++ 14445 else if (c == "i") 14446 IGNORECASE = 1 14447 else if (c == "l") 14448 filenames_only++ 14449 else if (c == "e") 14450 pattern = Optarg 14451 else 14452 usage() 14453 @} 14454@c endfile 14455@c @end group 14456@end example 14457 14458Next comes the code that handles the @code{egrep} specific behavior. If no 14459pattern was supplied with @samp{-e}, the first non-option on the command 14460line is used. The @code{awk} command line arguments up to @code{ARGV[Optind]} 14461are cleared, so that @code{awk} won't try to process them as files. If no 14462files were specified, the standard input is used, and if multiple files were 14463specified, we make sure to note this so that the file names can precede the 14464matched lines in the output. 14465 14466The last two lines are commented out, since they are not needed in 14467@code{gawk}. They should be uncommented if you have to use another version 14468of @code{awk}. 14469 14470@example 14471@c @group 14472@c file eg/prog/egrep.awk 14473 if (pattern == "") 14474 pattern = ARGV[Optind++] 14475 14476 for (i = 1; i < Optind; i++) 14477 ARGV[i] = "" 14478 if (Optind >= ARGC) @{ 14479 ARGV[1] = "-" 14480 ARGC = 2 14481 @} else if (ARGC - Optind > 1) 14482 do_filenames++ 14483 14484# if (IGNORECASE) 14485# pattern = tolower(pattern) 14486@} 14487@c endfile 14488@c @end group 14489@end example 14490 14491The next set of lines should be uncommented if you are not using 14492@code{gawk}. This rule translates all the characters in the input line 14493into lower-case if the @samp{-i} option was specified. The rule is 14494commented out since it is not necessary with @code{gawk}. 14495@c bug: if a match happens, we output the translated line, not the original 14496 14497@example 14498@c @group 14499@c file eg/prog/egrep.awk 14500#@{ 14501# if (IGNORECASE) 14502# $0 = tolower($0) 14503#@} 14504@c endfile 14505@c @end group 14506@end example 14507 14508The @code{beginfile} function is called by the rule in @file{ftrans.awk} 14509when each new file is processed. In this case, it is very simple; all it 14510does is initialize a variable @code{fcount} to zero. @code{fcount} tracks 14511how many lines in the current file matched the pattern. 14512 14513@example 14514@group 14515@c file eg/prog/egrep.awk 14516function beginfile(junk) 14517@{ 14518 fcount = 0 14519@} 14520@c endfile 14521@end group 14522@end example 14523 14524The @code{endfile} function is called after each file has been processed. 14525It is used only when the user wants a count of the number of lines that 14526matched. @code{no_print} will be true only if the exit status is desired. 14527@code{count_only} will be true if line counts are desired. @code{egrep} 14528will therefore only print line counts if printing and counting are enabled. 14529The output format must be adjusted depending upon the number of files to be 14530processed. Finally, @code{fcount} is added to @code{total}, so that we 14531know how many lines altogether matched the pattern. 14532 14533@example 14534@group 14535@c file eg/prog/egrep.awk 14536function endfile(file) 14537@{ 14538 if (! no_print && count_only) 14539 if (do_filenames) 14540 print file ":" fcount 14541 else 14542 print fcount 14543 14544 total += fcount 14545@} 14546@c endfile 14547@end group 14548@end example 14549 14550This rule does most of the work of matching lines. The variable 14551@code{matches} will be true if the line matched the pattern. If the user 14552wants lines that did not match, the sense of the @code{matches} is inverted 14553using the @samp{!} operator. @code{fcount} is incremented with the value of 14554@code{matches}, which will be either one or zero, depending upon a 14555successful or unsuccessful match. If the line did not match, the 14556@code{next} statement just moves on to the next record. 14557 14558There are several optimizations for performance in the following few lines 14559of code. If the user only wants exit status (@code{no_print} is true), and 14560we don't have to count lines, then it is enough to know that one line in 14561this file matched, and we can skip on to the next file with @code{nextfile}. 14562Along similar lines, if we are only printing file names, and we 14563don't need to count lines, we can print the file name, and then skip to the 14564next file with @code{nextfile}. 14565 14566Finally, each line is printed, with a leading filename and colon if 14567necessary. 14568 14569@ignore 145702e: note, probably better to recode the last few lines as 14571 if (! count_only) @{ 14572 if (no_print) 14573 nextfile 14574 14575 if (filenames_only) @{ 14576 print FILENAME 14577 nextfile 14578 @} 14579 14580 if (do_filenames) 14581 print FILENAME ":" $0 14582 else 14583 print 14584 @} 14585@end ignore 14586 14587@example 14588@c @group 14589@c file eg/prog/egrep.awk 14590@{ 14591 matches = ($0 ~ pattern) 14592 if (invert) 14593 matches = ! matches 14594 14595 fcount += matches # 1 or 0 14596 14597 if (! matches) 14598 next 14599 14600 if (no_print && ! count_only) 14601 nextfile 14602 14603 if (filenames_only && ! count_only) @{ 14604 print FILENAME 14605 nextfile 14606 @} 14607 14608 if (do_filenames && ! count_only) 14609 print FILENAME ":" $0 14610@group 14611 else if (! count_only) 14612 print 14613@end group 14614@} 14615@c endfile 14616@c @end group 14617@end example 14618 14619@c @strong{Exercise}: rearrange the code inside @samp{if (! count_only)}. 14620 14621The @code{END} rule takes care of producing the correct exit status. If 14622there were no matches, the exit status is one, otherwise it is zero. 14623 14624@example 14625@c @group 14626@c file eg/prog/egrep.awk 14627END \ 14628@{ 14629 if (total == 0) 14630 exit 1 14631 exit 0 14632@} 14633@c endfile 14634@c @end group 14635@end example 14636 14637The @code{usage} function prints a usage message in case of invalid options 14638and then exits. 14639 14640@example 14641@c @group 14642@c file eg/prog/egrep.awk 14643function usage( e) 14644@{ 14645 e = "Usage: egrep [-csvil] [-e pat] [files ...]" 14646 print e > "/dev/stderr" 14647 exit 1 14648@} 14649@c endfile 14650@c @end group 14651@end example 14652 14653The variable @code{e} is used so that the function fits nicely 14654on the printed page. 14655 14656@cindex backslash continuation 14657Just a note on programming style. You may have noticed that the @code{END} 14658rule uses backslash continuation, with the open brace on a line by 14659itself. This is so that it more closely resembles the way functions 14660are written. Many of the examples 14661@iftex 14662in this chapter 14663@end iftex 14664use this style. You can decide for yourself if you like writing 14665your @code{BEGIN} and @code{END} rules this way, 14666or not. 14667 14668@node Id Program, Split Program, Egrep Program, Clones 14669@subsection Printing Out User Information 14670 14671@cindex @code{id} utility 14672The @code{id} utility lists a user's real and effective user-id numbers, 14673real and effective group-id numbers, and the user's group set, if any. 14674@code{id} will only print the effective user-id and group-id if they are 14675different from the real ones. If possible, @code{id} will also supply the 14676corresponding user and group names. The output might look like this: 14677 14678@example 14679$ id 14680@print{} uid=2076(arnold) gid=10(staff) groups=10(staff),4(tty) 14681@end example 14682 14683This information is exactly what is provided by @code{gawk}'s 14684@file{/dev/user} special file (@pxref{Special Files, ,Special File Names in @code{gawk}}). 14685However, the @code{id} utility provides a more palatable output than just a 14686string of numbers. 14687 14688Here is a simple version of @code{id} written in @code{awk}. 14689It uses the user database library functions 14690(@pxref{Passwd Functions, ,Reading the User Database}), 14691and the group database library functions 14692(@pxref{Group Functions, ,Reading the Group Database}). 14693 14694The program is fairly straightforward. All the work is done in the 14695@code{BEGIN} rule. The user and group id numbers are obtained from 14696@file{/dev/user}. If there is no support for @file{/dev/user}, the program 14697gives up. 14698 14699The code is repetitive. The entry in the user database for the real user-id 14700number is split into parts at the @samp{:}. The name is the first field. 14701Similar code is used for the effective user-id number, and the group 14702numbers. 14703 14704@findex id.awk 14705@example 14706@c @group 14707@c file eg/prog/id.awk 14708# id.awk --- implement id in awk 14709# Arnold Robbins, arnold@@gnu.org, Public Domain 14710# May 1993 14711 14712# output is: 14713# uid=12(foo) euid=34(bar) gid=3(baz) \ 14714# egid=5(blat) groups=9(nine),2(two),1(one) 14715 14716BEGIN \ 14717@{ 14718 if ((getline < "/dev/user") < 0) @{ 14719 err = "id: no /dev/user support - cannot run" 14720 print err > "/dev/stderr" 14721 exit 1 14722 @} 14723 close("/dev/user") 14724 14725 uid = $1 14726 euid = $2 14727 gid = $3 14728 egid = $4 14729 14730 printf("uid=%d", uid) 14731 pw = getpwuid(uid) 14732@group 14733 if (pw != "") @{ 14734 split(pw, a, ":") 14735 printf("(%s)", a[1]) 14736 @} 14737@end group 14738 14739 if (euid != uid) @{ 14740 printf(" euid=%d", euid) 14741 pw = getpwuid(euid) 14742 if (pw != "") @{ 14743 split(pw, a, ":") 14744 printf("(%s)", a[1]) 14745 @} 14746 @} 14747 14748 printf(" gid=%d", gid) 14749 pw = getgrgid(gid) 14750 if (pw != "") @{ 14751 split(pw, a, ":") 14752 printf("(%s)", a[1]) 14753 @} 14754 14755 if (egid != gid) @{ 14756 printf(" egid=%d", egid) 14757 pw = getgrgid(egid) 14758 if (pw != "") @{ 14759 split(pw, a, ":") 14760 printf("(%s)", a[1]) 14761 @} 14762 @} 14763 14764 if (NF > 4) @{ 14765 printf(" groups="); 14766 for (i = 5; i <= NF; i++) @{ 14767 printf("%d", $i) 14768 pw = getgrgid($i) 14769 if (pw != "") @{ 14770 split(pw, a, ":") 14771 printf("(%s)", a[1]) 14772 @} 14773@group 14774 if (i < NF) 14775 printf(",") 14776@end group 14777 @} 14778 @} 14779 print "" 14780@} 14781@c endfile 14782@c @end group 14783@end example 14784 14785@c exercise!!! 14786@ignore 14787The POSIX version of @code{id} takes arguments that control which 14788information is printed. Modify this version to accept the same 14789arguments and perform in the same way. 14790@end ignore 14791 14792@node Split Program, Tee Program, Id Program, Clones 14793@subsection Splitting a Large File Into Pieces 14794 14795@cindex @code{split} utility 14796The @code{split} program splits large text files into smaller pieces. By default, 14797the output files are named @file{xaa}, @file{xab}, and so on. Each file has 147981000 lines in it, with the likely exception of the last file. To change the 14799number of lines in each file, you supply a number on the command line 14800preceded with a minus, e.g., @samp{-500} for files with 500 lines in them 14801instead of 1000. To change the name of the output files to something like 14802@file{myfileaa}, @file{myfileab}, and so on, you supply an additional 14803argument that specifies the filename. 14804 14805Here is a version of @code{split} in @code{awk}. It uses the @code{ord} and 14806@code{chr} functions presented in 14807@ref{Ordinal Functions, ,Translating Between Characters and Numbers}. 14808 14809The program first sets its defaults, and then tests to make sure there are 14810not too many arguments. It then looks at each argument in turn. The 14811first argument could be a minus followed by a number. If it is, this happens 14812to look like a negative number, so it is made positive, and that is the 14813count of lines. The data file name is skipped over, and the final argument 14814is used as the prefix for the output file names. 14815 14816@findex split.awk 14817@example 14818@c @group 14819@c file eg/prog/split.awk 14820# split.awk --- do split in awk 14821# Arnold Robbins, arnold@@gnu.org, Public Domain 14822# May 1993 14823 14824# usage: split [-num] [file] [outname] 14825 14826BEGIN @{ 14827 outfile = "x" # default 14828 count = 1000 14829 if (ARGC > 4) 14830 usage() 14831 14832 i = 1 14833 if (ARGV[i] ~ /^-[0-9]+$/) @{ 14834 count = -ARGV[i] 14835 ARGV[i] = "" 14836 i++ 14837 @} 14838 # test argv in case reading from stdin instead of file 14839 if (i in ARGV) 14840 i++ # skip data file name 14841 if (i in ARGV) @{ 14842 outfile = ARGV[i] 14843 ARGV[i] = "" 14844 @} 14845 14846 s1 = s2 = "a" 14847 out = (outfile s1 s2) 14848@} 14849@c endfile 14850@c @end group 14851@end example 14852 14853The next rule does most of the work. @code{tcount} (temporary count) tracks 14854how many lines have been printed to the output file so far. If it is greater 14855than @code{count}, it is time to close the current file and start a new one. 14856@code{s1} and @code{s2} track the current suffixes for the file name. If 14857they are both @samp{z}, the file is just too big. Otherwise, @code{s1} 14858moves to the next letter in the alphabet and @code{s2} starts over again at 14859@samp{a}. 14860 14861@example 14862@c @group 14863@c file eg/prog/split.awk 14864@{ 14865 if (++tcount > count) @{ 14866 close(out) 14867 if (s2 == "z") @{ 14868 if (s1 == "z") @{ 14869 printf("split: %s is too large to split\n", \ 14870 FILENAME) > "/dev/stderr" 14871 exit 1 14872 @} 14873 s1 = chr(ord(s1) + 1) 14874 s2 = "a" 14875 @} else 14876 s2 = chr(ord(s2) + 1) 14877 out = (outfile s1 s2) 14878 tcount = 1 14879 @} 14880 print > out 14881@} 14882@c endfile 14883@c @end group 14884@end example 14885 14886The @code{usage} function simply prints an error message and exits. 14887 14888@example 14889@c @group 14890@c file eg/prog/split.awk 14891function usage( e) 14892@{ 14893 e = "usage: split [-num] [file] [outname]" 14894 print e > "/dev/stderr" 14895 exit 1 14896@} 14897@c endfile 14898@c @end group 14899@end example 14900 14901@noindent 14902The variable @code{e} is used so that the function 14903fits nicely on the 14904@iftex 14905page. 14906@end iftex 14907@ifinfo 14908screen. 14909@end ifinfo 14910 14911This program is a bit sloppy; it relies on @code{awk} to close the last file 14912for it automatically, instead of doing it in an @code{END} rule. 14913 14914@node Tee Program, Uniq Program, Split Program, Clones 14915@subsection Duplicating Output Into Multiple Files 14916 14917@cindex @code{tee} utility 14918The @code{tee} program is known as a ``pipe fitting.'' @code{tee} copies 14919its standard input to its standard output, and also duplicates it to the 14920files named on the command line. Its usage is: 14921 14922@example 14923tee @r{[}-a@r{]} file @dots{} 14924@end example 14925 14926The @samp{-a} option tells @code{tee} to append to the named files, instead of 14927truncating them and starting over. 14928 14929The @code{BEGIN} rule first makes a copy of all the command line arguments, 14930into an array named @code{copy}. 14931@code{ARGV[0]} is not copied, since it is not needed. 14932@code{tee} cannot use @code{ARGV} directly, since @code{awk} will attempt to 14933process each file named in @code{ARGV} as input data. 14934 14935If the first argument is @samp{-a}, then the flag variable 14936@code{append} is set to true, and both @code{ARGV[1]} and 14937@code{copy[1]} are deleted. If @code{ARGC} is less than two, then no file 14938names were supplied, and @code{tee} prints a usage message and exits. 14939Finally, @code{awk} is forced to read the standard input by setting 14940@code{ARGV[1]} to @code{"-"}, and @code{ARGC} to two. 14941 14942@c 2e: the `ARGC--' in the `if (ARGV[1] == "-a")' isn't needed. 14943 14944@findex tee.awk 14945@example 14946@group 14947@c file eg/prog/tee.awk 14948# tee.awk --- tee in awk 14949# Arnold Robbins, arnold@@gnu.org, Public Domain 14950# May 1993 14951# Revised December 1995 14952@end group 14953 14954@group 14955BEGIN \ 14956@{ 14957 for (i = 1; i < ARGC; i++) 14958 copy[i] = ARGV[i] 14959@end group 14960 14961@group 14962 if (ARGV[1] == "-a") @{ 14963 append = 1 14964 delete ARGV[1] 14965 delete copy[1] 14966 ARGC-- 14967 @} 14968@end group 14969@group 14970 if (ARGC < 2) @{ 14971 print "usage: tee [-a] file ..." > "/dev/stderr" 14972 exit 1 14973 @} 14974@end group 14975@group 14976 ARGV[1] = "-" 14977 ARGC = 2 14978@} 14979@c endfile 14980@end group 14981@end example 14982 14983The single rule does all the work. Since there is no pattern, it is 14984executed for each line of input. The body of the rule simply prints the 14985line into each file on the command line, and then to the standard output. 14986 14987@example 14988@group 14989@c file eg/prog/tee.awk 14990@{ 14991 # moving the if outside the loop makes it run faster 14992 if (append) 14993 for (i in copy) 14994 print >> copy[i] 14995 else 14996 for (i in copy) 14997 print > copy[i] 14998 print 14999@} 15000@c endfile 15001@end group 15002@end example 15003 15004It would have been possible to code the loop this way: 15005 15006@example 15007for (i in copy) 15008 if (append) 15009 print >> copy[i] 15010 else 15011 print > copy[i] 15012@end example 15013 15014@noindent 15015This is more concise, but it is also less efficient. The @samp{if} is 15016tested for each record and for each output file. By duplicating the loop 15017body, the @samp{if} is only tested once for each input record. If there are 15018@var{N} input records and @var{M} input files, the first method only 15019executes @var{N} @samp{if} statements, while the second would execute 15020@var{N}@code{*}@var{M} @samp{if} statements. 15021 15022Finally, the @code{END} rule cleans up, by closing all the output files. 15023 15024@example 15025@c @group 15026@c file eg/prog/tee.awk 15027END \ 15028@{ 15029 for (i in copy) 15030 close(copy[i]) 15031@} 15032@c endfile 15033@c @end group 15034@end example 15035 15036@node Uniq Program, Wc Program, Tee Program, Clones 15037@subsection Printing Non-duplicated Lines of Text 15038 15039@cindex @code{uniq} utility 15040The @code{uniq} utility reads sorted lines of data on its standard input, 15041and (by default) removes duplicate lines. In other words, only unique lines 15042are printed, hence the name. @code{uniq} has a number of options. The usage is: 15043 15044@example 15045uniq @r{[}-udc @r{[}-@var{n}@r{]]} @r{[}+@var{n}@r{]} @r{[} @var{input file} @r{[} @var{output file} @r{]]} 15046@end example 15047 15048The option meanings are: 15049 15050@table @code 15051@item -d 15052Only print repeated lines. 15053 15054@item -u 15055Only print non-repeated lines. 15056 15057@item -c 15058Count lines. This option overrides @samp{-d} and @samp{-u}. Both repeated 15059and non-repeated lines are counted. 15060 15061@item -@var{n} 15062Skip @var{n} fields before comparing lines. The definition of fields 15063is similar to @code{awk}'s default: non-whitespace characters separated 15064by runs of spaces and/or tabs. 15065 15066@item +@var{n} 15067Skip @var{n} characters before comparing lines. Any fields specified with 15068@samp{-@var{n}} are skipped first. 15069 15070@item @var{input file} 15071Data is read from the input file named on the command line, instead of from 15072the standard input. 15073 15074@item @var{output file} 15075The generated output is sent to the named output file, instead of to the 15076standard output. 15077@end table 15078 15079Normally @code{uniq} behaves as if both the @samp{-d} and @samp{-u} options 15080had been provided. 15081 15082Here is an @code{awk} implementation of @code{uniq}. It uses the 15083@code{getopt} library function 15084(@pxref{Getopt Function, ,Processing Command Line Options}), 15085and the @code{join} library function 15086(@pxref{Join Function, ,Merging an Array Into a String}). 15087 15088The program begins with a @code{usage} function and then a brief outline of 15089the options and their meanings in a comment. 15090 15091The @code{BEGIN} rule deals with the command line arguments and options. It 15092uses a trick to get @code{getopt} to handle options of the form @samp{-25}, 15093treating such an option as the option letter @samp{2} with an argument of 15094@samp{5}. If indeed two or more digits were supplied (@code{Optarg} looks 15095like a number), @code{Optarg} is 15096concatenated with the option digit, and then result is added to zero to make 15097it into a number. If there is only one digit in the option, then 15098@code{Optarg} is not needed, and @code{Optind} must be decremented so that 15099@code{getopt} will process it next time. This code is admittedly a bit 15100tricky. 15101 15102If no options were supplied, then the default is taken, to print both 15103repeated and non-repeated lines. The output file, if provided, is assigned 15104to @code{outputfile}. Earlier, @code{outputfile} was initialized to the 15105standard output, @file{/dev/stdout}. 15106 15107@findex uniq.awk 15108@example 15109@c file eg/prog/uniq.awk 15110# uniq.awk --- do uniq in awk 15111# Arnold Robbins, arnold@@gnu.org, Public Domain 15112# May 1993 15113 15114@group 15115function usage( e) 15116@{ 15117 e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]" 15118 print e > "/dev/stderr" 15119 exit 1 15120@} 15121@end group 15122 15123# -c count lines. overrides -d and -u 15124# -d only repeated lines 15125# -u only non-repeated lines 15126# -n skip n fields 15127# +n skip n characters, skip fields first 15128 15129BEGIN \ 15130@{ 15131 count = 1 15132 outputfile = "/dev/stdout" 15133 opts = "udc0:1:2:3:4:5:6:7:8:9:" 15134 while ((c = getopt(ARGC, ARGV, opts)) != -1) @{ 15135 if (c == "u") 15136 non_repeated_only++ 15137 else if (c == "d") 15138 repeated_only++ 15139 else if (c == "c") 15140 do_count++ 15141 else if (index("0123456789", c) != 0) @{ 15142 # getopt requires args to options 15143 # this messes us up for things like -5 15144 if (Optarg ~ /^[0-9]+$/) 15145 fcount = (c Optarg) + 0 15146@group 15147 else @{ 15148 fcount = c + 0 15149 Optind-- 15150 @} 15151@end group 15152 @} else 15153 usage() 15154 @} 15155 15156 if (ARGV[Optind] ~ /^\+[0-9]+$/) @{ 15157 charcount = substr(ARGV[Optind], 2) + 0 15158 Optind++ 15159 @} 15160 15161 for (i = 1; i < Optind; i++) 15162 ARGV[i] = "" 15163 15164 if (repeated_only == 0 && non_repeated_only == 0) 15165 repeated_only = non_repeated_only = 1 15166 15167 if (ARGC - Optind == 2) @{ 15168 outputfile = ARGV[ARGC - 1] 15169 ARGV[ARGC - 1] = "" 15170 @} 15171@} 15172@c endfile 15173@end example 15174 15175The following function, @code{are_equal}, compares the current line, 15176@code{$0}, to the 15177previous line, @code{last}. It handles skipping fields and characters. 15178 15179If no field count and no character count were specified, @code{are_equal} 15180simply returns one or zero depending upon the result of a simple string 15181comparison of @code{last} and @code{$0}. Otherwise, things get more 15182complicated. 15183 15184If fields have to be skipped, each line is broken into an array using 15185@code{split} 15186(@pxref{String Functions, ,Built-in Functions for String Manipulation}), 15187and then the desired fields are joined back into a line using @code{join}. 15188The joined lines are stored in @code{clast} and @code{cline}. 15189If no fields are skipped, @code{clast} and @code{cline} are set to 15190@code{last} and @code{$0} respectively. 15191 15192Finally, if characters are skipped, @code{substr} is used to strip off the 15193leading @code{charcount} characters in @code{clast} and @code{cline}. The 15194two strings are then compared, and @code{are_equal} returns the result. 15195 15196@example 15197@c @group 15198@c file eg/prog/uniq.awk 15199function are_equal( n, m, clast, cline, alast, aline) 15200@{ 15201 if (fcount == 0 && charcount == 0) 15202 return (last == $0) 15203 15204 if (fcount > 0) @{ 15205 n = split(last, alast) 15206 m = split($0, aline) 15207 clast = join(alast, fcount+1, n) 15208 cline = join(aline, fcount+1, m) 15209 @} else @{ 15210 clast = last 15211 cline = $0 15212 @} 15213 if (charcount) @{ 15214 clast = substr(clast, charcount + 1) 15215 cline = substr(cline, charcount + 1) 15216 @} 15217 15218 return (clast == cline) 15219@} 15220@c endfile 15221@c @end group 15222@end example 15223 15224The following two rules are the body of the program. The first one is 15225executed only for the very first line of data. It sets @code{last} equal to 15226@code{$0}, so that subsequent lines of text have something to be compared to. 15227 15228The second rule does the work. The variable @code{equal} will be one or zero 15229depending upon the results of @code{are_equal}'s comparison. If @code{uniq} 15230is counting repeated lines, then the @code{count} variable is incremented if 15231the lines are equal. Otherwise the line is printed and @code{count} is 15232reset, since the two lines are not equal. 15233 15234If @code{uniq} is not counting, @code{count} is incremented if the lines are 15235equal. Otherwise, if @code{uniq} is counting repeated lines, and more than 15236one line has been seen, or if @code{uniq} is counting non-repeated lines, 15237and only one line has been seen, then the line is printed, and @code{count} 15238is reset. 15239 15240Finally, similar logic is used in the @code{END} rule to print the final 15241line of input data. 15242 15243@example 15244@c @group 15245@c file eg/prog/uniq.awk 15246@group 15247NR == 1 @{ 15248 last = $0 15249 next 15250@} 15251@end group 15252 15253@{ 15254 equal = are_equal() 15255 15256 if (do_count) @{ # overrides -d and -u 15257 if (equal) 15258 count++ 15259 else @{ 15260 printf("%4d %s\n", count, last) > outputfile 15261 last = $0 15262 count = 1 # reset 15263 @} 15264 next 15265 @} 15266 15267 if (equal) 15268 count++ 15269 else @{ 15270 if ((repeated_only && count > 1) || 15271 (non_repeated_only && count == 1)) 15272 print last > outputfile 15273 last = $0 15274 count = 1 15275 @} 15276@} 15277 15278@group 15279END @{ 15280 if (do_count) 15281 printf("%4d %s\n", count, last) > outputfile 15282 else if ((repeated_only && count > 1) || 15283 (non_repeated_only && count == 1)) 15284 print last > outputfile 15285@} 15286@end group 15287@c endfile 15288@c @end group 15289@end example 15290 15291@node Wc Program, , Uniq Program, Clones 15292@subsection Counting Things 15293 15294@cindex @code{wc} utility 15295The @code{wc} (word count) utility counts lines, words, and characters in 15296one or more input files. Its usage is: 15297 15298@example 15299wc @r{[}-lwc@r{]} @r{[} @var{files} @dots{} @r{]} 15300@end example 15301 15302If no files are specified on the command line, @code{wc} reads its standard 15303input. If there are multiple files, it will also print total counts for all 15304the files. The options and their meanings are: 15305 15306@table @code 15307@item -l 15308Only count lines. 15309 15310@item -w 15311Only count words. 15312A ``word'' is a contiguous sequence of non-whitespace characters, separated 15313by spaces and/or tabs. Happily, this is the normal way @code{awk} separates 15314fields in its input data. 15315 15316@item -c 15317Only count characters. 15318@end table 15319 15320Implementing @code{wc} in @code{awk} is particularly elegant, since 15321@code{awk} does a lot of the work for us; it splits lines into words (i.e.@: 15322fields) and counts them, it counts lines (i.e.@: records) for us, and it can 15323easily tell us how long a line is. 15324 15325This version uses the @code{getopt} library function 15326(@pxref{Getopt Function, ,Processing Command Line Options}), 15327and the file transition functions 15328(@pxref{Filetrans Function, ,Noting Data File Boundaries}). 15329 15330This version has one major difference from traditional versions of @code{wc}. 15331Our version always prints the counts in the order lines, words, 15332and characters. Traditional versions note the order of the @samp{-l}, 15333@samp{-w}, and @samp{-c} options on the command line, and print the counts 15334in that order. 15335 15336The @code{BEGIN} rule does the argument processing. 15337The variable @code{print_total} will 15338be true if more than one file was named on the command line. 15339 15340@findex wc.awk 15341@example 15342@c @group 15343@c file eg/prog/wc.awk 15344# wc.awk --- count lines, words, characters 15345# Arnold Robbins, arnold@@gnu.org, Public Domain 15346# May 1993 15347 15348# Options: 15349# -l only count lines 15350# -w only count words 15351# -c only count characters 15352# 15353# Default is to count lines, words, characters 15354 15355BEGIN @{ 15356 # let getopt print a message about 15357 # invalid options. we ignore them 15358 while ((c = getopt(ARGC, ARGV, "lwc")) != -1) @{ 15359 if (c == "l") 15360 do_lines = 1 15361 else if (c == "w") 15362 do_words = 1 15363 else if (c == "c") 15364 do_chars = 1 15365 @} 15366 for (i = 1; i < Optind; i++) 15367 ARGV[i] = "" 15368 15369 # if no options, do all 15370 if (! do_lines && ! do_words && ! do_chars) 15371 do_lines = do_words = do_chars = 1 15372 15373 print_total = (ARGC - i > 2) 15374@} 15375@c endfile 15376@c @end group 15377@end example 15378 15379The @code{beginfile} function is simple; it just resets the counts of lines, 15380words, and characters to zero, and saves the current file name in 15381@code{fname}. 15382 15383The @code{endfile} function adds the current file's numbers to the running 15384totals of lines, words, and characters. It then prints out those numbers 15385for the file that was just read. It relies on @code{beginfile} to reset the 15386numbers for the following data file. 15387 15388@example 15389@c left brace on line with `function' because of page breaking 15390@c file eg/prog/wc.awk 15391@group 15392function beginfile(file) @{ 15393 chars = lines = words = 0 15394 fname = FILENAME 15395@} 15396@end group 15397 15398function endfile(file) 15399@{ 15400 tchars += chars 15401 tlines += lines 15402 twords += words 15403 if (do_lines) 15404 printf "\t%d", lines 15405 if (do_words) 15406 printf "\t%d", words 15407 if (do_chars) 15408 printf "\t%d", chars 15409 printf "\t%s\n", fname 15410@} 15411@c endfile 15412@end example 15413 15414There is one rule that is executed for each line. It adds the length of the 15415record to @code{chars}. It has to add one, since the newline character 15416separating records (the value of @code{RS}) is not part of the record 15417itself. @code{lines} is incremented for each line read, and @code{words} is 15418incremented by the value of @code{NF}, the number of ``words'' on this 15419line.@footnote{Examine the code in 15420@ref{Filetrans Function, ,Noting Data File Boundaries}. 15421Why must @code{wc} use a separate @code{lines} variable, instead of using 15422the value of @code{FNR} in @code{endfile}?} 15423 15424Finally, the @code{END} rule simply prints the totals for all the files. 15425 15426@example 15427@c @group 15428@c file eg/prog/wc.awk 15429# do per line 15430@{ 15431 chars += length($0) + 1 # get newline 15432 lines++ 15433 words += NF 15434@} 15435 15436END @{ 15437 if (print_total) @{ 15438 if (do_lines) 15439 printf "\t%d", tlines 15440 if (do_words) 15441 printf "\t%d", twords 15442 if (do_chars) 15443 printf "\t%d", tchars 15444 print "\ttotal" 15445 @} 15446@} 15447@c endfile 15448@c @end group 15449@end example 15450 15451@node Miscellaneous Programs, , Clones, Sample Programs 15452@section A Grab Bag of @code{awk} Programs 15453 15454This section is a large ``grab bag'' of miscellaneous programs. 15455We hope you find them both interesting and enjoyable. 15456 15457@menu 15458* Dupword Program:: Finding duplicated words in a document. 15459* Alarm Program:: An alarm clock. 15460* Translate Program:: A program similar to the @code{tr} utility. 15461* Labels Program:: Printing mailing labels. 15462* Word Sorting:: A program to produce a word usage count. 15463* History Sorting:: Eliminating duplicate entries from a history 15464 file. 15465* Extract Program:: Pulling out programs from Texinfo source 15466 files. 15467* Simple Sed:: A Simple Stream Editor. 15468* Igawk Program:: A wrapper for @code{awk} that includes files. 15469@end menu 15470 15471@node Dupword Program, Alarm Program, Miscellaneous Programs, Miscellaneous Programs 15472@subsection Finding Duplicated Words in a Document 15473 15474A common error when writing large amounts of prose is to accidentally 15475duplicate words. Often you will see this in text as something like ``the 15476the program does the following @dots{}.'' When the text is on-line, often 15477the duplicated words occur at the end of one line and the beginning of 15478another, making them very difficult to spot. 15479@c as here! 15480 15481This program, @file{dupword.awk}, scans through a file one line at a time, 15482and looks for adjacent occurrences of the same word. It also saves the last 15483word on a line (in the variable @code{prev}) for comparison with the first 15484word on the next line. 15485 15486The first two statements make sure that the line is all lower-case, so that, 15487for example, 15488``The'' and ``the'' compare equal to each other. The second statement 15489removes all non-alphanumeric and non-whitespace characters from the line, so 15490that punctuation does not affect the comparison either. This sometimes 15491leads to reports of duplicated words that really are different, but this is 15492unusual. 15493 15494@c FIXME: add check for $i != "" 15495@findex dupword.awk 15496@example 15497@group 15498@c file eg/prog/dupword.awk 15499# dupword --- find duplicate words in text 15500# Arnold Robbins, arnold@@gnu.org, Public Domain 15501# December 1991 15502 15503@{ 15504 $0 = tolower($0) 15505 gsub(/[^A-Za-z0-9 \t]/, ""); 15506 if ($1 == prev) 15507 printf("%s:%d: duplicate %s\n", 15508 FILENAME, FNR, $1) 15509 for (i = 2; i <= NF; i++) 15510 if ($i == $(i-1)) 15511 printf("%s:%d: duplicate %s\n", 15512 FILENAME, FNR, $i) 15513 prev = $NF 15514@} 15515@c endfile 15516@end group 15517@end example 15518 15519@node Alarm Program, Translate Program, Dupword Program, Miscellaneous Programs 15520@subsection An Alarm Clock Program 15521 15522The following program is a simple ``alarm clock'' program. 15523You give it a time of day, and an optional message. At the given time, 15524it prints the message on the standard output. In addition, you can give it 15525the number of times to repeat the message, and also a delay between 15526repetitions. 15527 15528This program uses the @code{gettimeofday} function from 15529@ref{Gettimeofday Function, ,Managing the Time of Day}. 15530 15531All the work is done in the @code{BEGIN} rule. The first part is argument 15532checking and setting of defaults; the delay, the count, and the message to 15533print. If the user supplied a message, but it does not contain the ASCII BEL 15534character (known as the ``alert'' character, @samp{\a}), then it is added to 15535the message. (On many systems, printing the ASCII BEL generates some sort 15536of audible alert. Thus, when the alarm goes off, the system calls attention 15537to itself, in case the user is not looking at their computer or terminal.) 15538 15539@findex alarm.awk 15540@example 15541@c @group 15542@c file eg/prog/alarm.awk 15543# alarm --- set an alarm 15544# Arnold Robbins, arnold@@gnu.org, Public Domain 15545# May 1993 15546 15547# usage: alarm time [ "message" [ count [ delay ] ] ] 15548 15549BEGIN \ 15550@{ 15551 # Initial argument sanity checking 15552 usage1 = "usage: alarm time ['message' [count [delay]]]" 15553 usage2 = sprintf("\t(%s) time ::= hh:mm", ARGV[1]) 15554 15555 if (ARGC < 2) @{ 15556 print usage > "/dev/stderr" 15557 exit 1 15558 @} else if (ARGC == 5) @{ 15559 delay = ARGV[4] + 0 15560 count = ARGV[3] + 0 15561 message = ARGV[2] 15562 @} else if (ARGC == 4) @{ 15563 count = ARGV[3] + 0 15564 message = ARGV[2] 15565 @} else if (ARGC == 3) @{ 15566 message = ARGV[2] 15567 @} else if (ARGV[1] !~ /[0-9]?[0-9]:[0-9][0-9]/) @{ 15568 print usage1 > "/dev/stderr" 15569 print usage2 > "/dev/stderr" 15570 exit 1 15571 @} 15572 15573 # set defaults for once we reach the desired time 15574 if (delay == 0) 15575 delay = 180 # 3 minutes 15576 if (count == 0) 15577 count = 5 15578@group 15579 if (message == "") 15580 message = sprintf("\aIt is now %s!\a", ARGV[1]) 15581 else if (index(message, "\a") == 0) 15582 message = "\a" message "\a" 15583@end group 15584@c endfile 15585@end example 15586 15587The next section of code turns the alarm time into hours and minutes, 15588and converts it if necessary to a 24-hour clock. Then it turns that 15589time into a count of the seconds since midnight. Next it turns the current 15590time into a count of seconds since midnight. The difference between the two 15591is how long to wait before setting off the alarm. 15592 15593@example 15594@c @group 15595@c file eg/prog/alarm.awk 15596 # split up dest time 15597 split(ARGV[1], atime, ":") 15598 hour = atime[1] + 0 # force numeric 15599 minute = atime[2] + 0 # force numeric 15600 15601 # get current broken down time 15602 gettimeofday(now) 15603 15604 # if time given is 12-hour hours and it's after that 15605 # hour, e.g., `alarm 5:30' at 9 a.m. means 5:30 p.m., 15606 # then add 12 to real hour 15607 if (hour < 12 && now["hour"] > hour) 15608 hour += 12 15609 15610 # set target time in seconds since midnight 15611 target = (hour * 60 * 60) + (minute * 60) 15612 15613 # get current time in seconds since midnight 15614 current = (now["hour"] * 60 * 60) + \ 15615 (now["minute"] * 60) + now["second"] 15616 15617 # how long to sleep for 15618 naptime = target - current 15619 if (naptime <= 0) @{ 15620 print "time is in the past!" > "/dev/stderr" 15621 exit 1 15622 @} 15623@c endfile 15624@c @end group 15625@end example 15626 15627Finally, the program uses the @code{system} function 15628(@pxref{I/O Functions, ,Built-in Functions for Input/Output}) 15629to call the @code{sleep} utility. The @code{sleep} utility simply pauses 15630for the given number of seconds. If the exit status is not zero, 15631the program assumes that @code{sleep} was interrupted, and exits. If 15632@code{sleep} exited with an OK status (zero), then the program prints the 15633message in a loop, again using @code{sleep} to delay for however many 15634seconds are necessary. 15635 15636@example 15637@c file eg/prog/alarm.awk 15638@group 15639 # zzzzzz..... go away if interrupted 15640 if (system(sprintf("sleep %d", naptime)) != 0) 15641 exit 1 15642@end group 15643 15644 # time to notify! 15645 command = sprintf("sleep %d", delay) 15646 for (i = 1; i <= count; i++) @{ 15647 print message 15648 # if sleep command interrupted, go away 15649 if (system(command) != 0) 15650 break 15651 @} 15652 15653 exit 0 15654@} 15655@c endfile 15656@end example 15657 15658@node Translate Program, Labels Program, Alarm Program, Miscellaneous Programs 15659@subsection Transliterating Characters 15660 15661The system @code{tr} utility transliterates characters. For example, it is 15662often used to map upper-case letters into lower-case, for further 15663processing. 15664 15665@example 15666@var{generate data} | tr '[A-Z]' '[a-z]' | @var{process data} @dots{} 15667@end example 15668 15669You give @code{tr} two lists of characters enclosed in square brackets. 15670Usually, the lists are quoted to keep the shell from attempting to do a 15671filename expansion.@footnote{On older, non-POSIX systems, @code{tr} often 15672does not require that the lists be enclosed in square brackets and quoted. 15673This is a feature.} When processing the input, the 15674first character in the first list is replaced with the first character in the 15675second list, the second character in the first list is replaced with the 15676second character in the second list, and so on. 15677If there are more characters in the ``from'' list than in the ``to'' list, 15678the last character of the ``to'' list is used for the remaining characters 15679in the ``from'' list. 15680 15681Some time ago, 15682@c early or mid-1989! 15683a user proposed to us that we add a transliteration function to @code{gawk}. 15684Being opposed to ``creeping featurism,'' I wrote the following program to 15685prove that character transliteration could be done with a user-level 15686function. This program is not as complete as the system @code{tr} utility, 15687but it will do most of the job. 15688 15689The @code{translate} program demonstrates one of the few weaknesses of 15690standard 15691@code{awk}: dealing with individual characters is very painful, requiring 15692repeated use of the @code{substr}, @code{index}, and @code{gsub} built-in 15693functions 15694(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@footnote{This 15695program was written before @code{gawk} acquired the ability to 15696split each character in a string into separate array elements. 15697How might you use this new feature to simplify the program?} 15698 15699There are two functions. The first, @code{stranslate}, takes three 15700arguments. 15701 15702@table @code 15703@item from 15704A list of characters to translate from. 15705 15706@item to 15707A list of characters to translate to. 15708 15709@item target 15710The string to do the translation on. 15711@end table 15712 15713Associative arrays make the translation part fairly easy. @code{t_ar} holds 15714the ``to'' characters, indexed by the ``from'' characters. Then a simple 15715loop goes through @code{from}, one character at a time. For each character 15716in @code{from}, if the character appears in @code{target}, @code{gsub} 15717is used to change it to the corresponding @code{to} character. 15718 15719The @code{translate} function simply calls @code{stranslate} using @code{$0} 15720as the target. The main program sets two global variables, @code{FROM} and 15721@code{TO}, from the command line, and then changes @code{ARGV} so that 15722@code{awk} will read from the standard input. 15723 15724Finally, the processing rule simply calls @code{translate} for each record. 15725 15726@findex translate.awk 15727@example 15728@c @group 15729@c file eg/prog/translate.awk 15730# translate --- do tr like stuff 15731# Arnold Robbins, arnold@@gnu.org, Public Domain 15732# August 1989 15733 15734# bugs: does not handle things like: tr A-Z a-z, it has 15735# to be spelled out. However, if `to' is shorter than `from', 15736# the last character in `to' is used for the rest of `from'. 15737 15738function stranslate(from, to, target, lf, lt, t_ar, i, c) 15739@{ 15740 lf = length(from) 15741 lt = length(to) 15742 for (i = 1; i <= lt; i++) 15743 t_ar[substr(from, i, 1)] = substr(to, i, 1) 15744 if (lt < lf) 15745 for (; i <= lf; i++) 15746 t_ar[substr(from, i, 1)] = substr(to, lt, 1) 15747 for (i = 1; i <= lf; i++) @{ 15748 c = substr(from, i, 1) 15749 if (index(target, c) > 0) 15750 gsub(c, t_ar[c], target) 15751 @} 15752 return target 15753@} 15754 15755function translate(from, to) 15756@{ 15757 return $0 = stranslate(from, to, $0) 15758@} 15759 15760@group 15761# main program 15762BEGIN @{ 15763 if (ARGC < 3) @{ 15764 print "usage: translate from to" > "/dev/stderr" 15765 exit 15766 @} 15767@end group 15768 FROM = ARGV[1] 15769 TO = ARGV[2] 15770 ARGC = 2 15771 ARGV[1] = "-" 15772@} 15773 15774@{ 15775 translate(FROM, TO) 15776 print 15777@} 15778@c endfile 15779@c @end group 15780@end example 15781 15782While it is possible to do character transliteration in a user-level 15783function, it is not necessarily efficient, and we started to consider adding 15784a built-in function. However, shortly after writing this program, we learned 15785that the System V Release 4 @code{awk} had added the @code{toupper} and 15786@code{tolower} functions. These functions handle the vast majority of the 15787cases where character transliteration is necessary, and so we chose to 15788simply add those functions to @code{gawk} as well, and then leave well 15789enough alone. 15790 15791An obvious improvement to this program would be to set up the 15792@code{t_ar} array only once, in a @code{BEGIN} rule. However, this 15793assumes that the ``from'' and ``to'' lists 15794will never change throughout the lifetime of the program. 15795 15796@node Labels Program, Word Sorting, Translate Program, Miscellaneous Programs 15797@subsection Printing Mailing Labels 15798 15799Here is a ``real world''@footnote{``Real world'' is defined as 15800``a program actually used to get something done.''} 15801program. This script reads lists of names and 15802addresses, and generates mailing labels. Each page of labels has 20 labels 15803on it, two across and ten down. The addresses are guaranteed to be no more 15804than five lines of data. Each address is separated from the next by a blank 15805line. 15806 15807The basic idea is to read 20 labels worth of data. Each line of each label 15808is stored in the @code{line} array. The single rule takes care of filling 15809the @code{line} array and printing the page when 20 labels have been read. 15810 15811The @code{BEGIN} rule simply sets @code{RS} to the empty string, so that 15812@code{awk} will split records at blank lines 15813(@pxref{Records, ,How Input is Split into Records}). 15814It sets @code{MAXLINES} to 100, since @code{MAXLINE} is the maximum number 15815of lines on the page (20 * 5 = 100). 15816 15817Most of the work is done in the @code{printpage} function. 15818The label lines are stored sequentially in the @code{line} array. But they 15819have to be printed horizontally; @code{line[1]} next to @code{line[6]}, 15820@code{line[2]} next to @code{line[7]}, and so on. Two loops are used to 15821accomplish this. The outer loop, controlled by @code{i}, steps through 15822every 10 lines of data; this is each row of labels. The inner loop, 15823controlled by @code{j}, goes through the lines within the row. 15824As @code{j} goes from zero to four, @samp{i+j} is the @code{j}'th line in 15825the row, and @samp{i+j+5} is the entry next to it. The output ends up 15826looking something like this: 15827 15828@example 15829line 1 line 6 15830line 2 line 7 15831line 3 line 8 15832line 4 line 9 15833line 5 line 10 15834@end example 15835 15836As a final note, at lines 21 and 61, an extra blank line is printed, to keep 15837the output lined up on the labels. This is dependent on the particular 15838brand of labels in use when the program was written. You will also note 15839that there are two blank lines at the top and two blank lines at the bottom. 15840 15841The @code{END} rule arranges to flush the final page of labels; there may 15842not have been an even multiple of 20 labels in the data. 15843 15844@findex labels.awk 15845@example 15846@c @group 15847@c file eg/prog/labels.awk 15848# labels.awk 15849# Arnold Robbins, arnold@@gnu.org, Public Domain 15850# June 1992 15851 15852# Program to print labels. Each label is 5 lines of data 15853# that may have blank lines. The label sheets have 2 15854# blank lines at the top and 2 at the bottom. 15855 15856BEGIN @{ RS = "" ; MAXLINES = 100 @} 15857 15858function printpage( i, j) 15859@{ 15860 if (Nlines <= 0) 15861 return 15862 15863 printf "\n\n" # header 15864 15865 for (i = 1; i <= Nlines; i += 10) @{ 15866 if (i == 21 || i == 61) 15867 print "" 15868 for (j = 0; j < 5; j++) @{ 15869 if (i + j > MAXLINES) 15870 break 15871 printf " %-41s %s\n", line[i+j], line[i+j+5] 15872 @} 15873 print "" 15874 @} 15875 15876 printf "\n\n" # footer 15877 15878 for (i in line) 15879 line[i] = "" 15880@} 15881 15882# main rule 15883@{ 15884 if (Count >= 20) @{ 15885 printpage() 15886 Count = 0 15887 Nlines = 0 15888 @} 15889 n = split($0, a, "\n") 15890 for (i = 1; i <= n; i++) 15891 line[++Nlines] = a[i] 15892 for (; i <= 5; i++) 15893 line[++Nlines] = "" 15894 Count++ 15895@} 15896 15897END \ 15898@{ 15899 printpage() 15900@} 15901@c endfile 15902@c @end group 15903@end example 15904 15905@node Word Sorting, History Sorting, Labels Program, Miscellaneous Programs 15906@subsection Generating Word Usage Counts 15907 15908The following @code{awk} program prints 15909the number of occurrences of each word in its input. It illustrates the 15910associative nature of @code{awk} arrays by using strings as subscripts. It 15911also demonstrates the @samp{for @var{x} in @var{array}} construction. 15912Finally, it shows how @code{awk} can be used in conjunction with other 15913utility programs to do a useful task of some complexity with a minimum of 15914effort. Some explanations follow the program listing. 15915 15916@example 15917awk ' 15918# Print list of word frequencies 15919@{ 15920 for (i = 1; i <= NF; i++) 15921 freq[$i]++ 15922@} 15923 15924@group 15925END @{ 15926 for (word in freq) 15927 printf "%s\t%d\n", word, freq[word] 15928@}' 15929@end group 15930@end example 15931 15932The first thing to notice about this program is that it has two rules. The 15933first rule, because it has an empty pattern, is executed on every line of 15934the input. It uses @code{awk}'s field-accessing mechanism 15935(@pxref{Fields, ,Examining Fields}) to pick out the individual words from 15936the line, and the built-in variable @code{NF} (@pxref{Built-in Variables}) 15937to know how many fields are available. 15938 15939For each input word, an element of the array @code{freq} is incremented to 15940reflect that the word has been seen an additional time. 15941 15942The second rule, because it has the pattern @code{END}, is not executed 15943until the input has been exhausted. It prints out the contents of the 15944@code{freq} table that has been built up inside the first action. 15945 15946This program has several problems that would prevent it from being 15947useful by itself on real text files: 15948 15949@itemize @bullet 15950@item 15951Words are detected using the @code{awk} convention that fields are 15952separated by whitespace and that other characters in the input (except 15953newlines) don't have any special meaning to @code{awk}. This means that 15954punctuation characters count as part of words. 15955 15956@item 15957The @code{awk} language considers upper- and lower-case characters to be 15958distinct. Therefore, @samp{bartender} and @samp{Bartender} are not treated 15959as the same word. This is undesirable since, in normal text, words 15960are capitalized if they begin sentences, and a frequency analyzer should not 15961be sensitive to capitalization. 15962 15963@item 15964The output does not come out in any useful order. You're more likely to be 15965interested in which words occur most frequently, or having an alphabetized 15966table of how frequently each word occurs. 15967@end itemize 15968 15969The way to solve these problems is to use some of the more advanced 15970features of the @code{awk} language. First, we use @code{tolower} to remove 15971case distinctions. Next, we use @code{gsub} to remove punctuation 15972characters. Finally, we use the system @code{sort} utility to process the 15973output of the @code{awk} script. Here is the new version of 15974the program: 15975 15976@findex wordfreq.sh 15977@example 15978@c file eg/prog/wordfreq.awk 15979# Print list of word frequencies 15980@{ 15981 $0 = tolower($0) # remove case distinctions 15982 gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation 15983 for (i = 1; i <= NF; i++) 15984 freq[$i]++ 15985@} 15986@c endfile 15987 15988@group 15989END @{ 15990 for (word in freq) 15991 printf "%s\t%d\n", word, freq[word] 15992@} 15993@end group 15994@end example 15995 15996Assuming we have saved this program in a file named @file{wordfreq.awk}, 15997and that the data is in @file{file1}, the following pipeline 15998 15999@example 16000awk -f wordfreq.awk file1 | sort +1 -nr 16001@end example 16002 16003@noindent 16004produces a table of the words appearing in @file{file1} in order of 16005decreasing frequency. 16006 16007The @code{awk} program suitably massages the data and produces a word 16008frequency table, which is not ordered. 16009 16010The @code{awk} script's output is then sorted by the @code{sort} utility and 16011printed on the terminal. The options given to @code{sort} in this example 16012specify to sort using the second field of each input line (skipping one field), 16013that the sort keys should be treated as numeric quantities (otherwise 16014@samp{15} would come before @samp{5}), and that the sorting should be done 16015in descending (reverse) order. 16016 16017We could have even done the @code{sort} from within the program, by 16018changing the @code{END} action to: 16019 16020@example 16021@c file eg/prog/wordfreq.awk 16022END @{ 16023 sort = "sort +1 -nr" 16024 for (word in freq) 16025 printf "%s\t%d\n", word, freq[word] | sort 16026 close(sort) 16027@} 16028@c endfile 16029@end example 16030 16031You would have to use this way of sorting on systems that do not 16032have true pipes. 16033 16034See the general operating system documentation for more information on how 16035to use the @code{sort} program. 16036 16037@node History Sorting, Extract Program, Word Sorting, Miscellaneous Programs 16038@subsection Removing Duplicates from Unsorted Text 16039 16040The @code{uniq} program 16041(@pxref{Uniq Program, ,Printing Non-duplicated Lines of Text}), 16042removes duplicate lines from @emph{sorted} data. 16043 16044Suppose, however, you need to remove duplicate lines from a data file, but 16045that you wish to preserve the order the lines are in? A good example of 16046this might be a shell history file. The history file keeps a copy of all 16047the commands you have entered, and it is not unusual to repeat a command 16048several times in a row. Occasionally you might wish to compact the history 16049by removing duplicate entries. Yet it is desirable to maintain the order 16050of the original commands. 16051 16052This simple program does the job. It uses two arrays. The @code{data} 16053array is indexed by the text of each line. 16054For each line, @code{data[$0]} is incremented. 16055 16056If a particular line has not 16057been seen before, then @code{data[$0]} will be zero. 16058In that case, the text of the line is stored in @code{lines[count]}. 16059Each element of @code{lines} is a unique command, and the indices of 16060@code{lines} indicate the order in which those lines were encountered. 16061The @code{END} rule simply prints out the lines, in order. 16062 16063@cindex Rakitzis, Byron 16064@findex histsort.awk 16065@example 16066@group 16067@c file eg/prog/histsort.awk 16068# histsort.awk --- compact a shell history file 16069# Arnold Robbins, arnold@@gnu.org, Public Domain 16070# May 1993 16071 16072# Thanks to Byron Rakitzis for the general idea 16073@{ 16074 if (data[$0]++ == 0) 16075 lines[++count] = $0 16076@} 16077 16078END @{ 16079 for (i = 1; i <= count; i++) 16080 print lines[i] 16081@} 16082@c endfile 16083@end group 16084@end example 16085 16086This program also provides a foundation for generating other useful 16087information. For example, using the following @code{print} satement in the 16088@code{END} rule would indicate how often a particular command was used. 16089 16090@example 16091print data[lines[i]], lines[i] 16092@end example 16093 16094This works because @code{data[$0]} was incremented each time a line was 16095seen. 16096 16097@node Extract Program, Simple Sed, History Sorting, Miscellaneous Programs 16098@subsection Extracting Programs from Texinfo Source Files 16099 16100@iftex 16101Both this chapter and the previous chapter 16102(@ref{Library Functions, ,A Library of @code{awk} Functions}), 16103present a large number of @code{awk} programs. 16104@end iftex 16105@ifinfo 16106The nodes 16107@ref{Library Functions, ,A Library of @code{awk} Functions}, 16108and @ref{Sample Programs, ,Practical @code{awk} Programs}, 16109are the top level nodes for a large number of @code{awk} programs. 16110@end ifinfo 16111If you wish to experiment with these programs, it is tedious to have to type 16112them in by hand. Here we present a program that can extract parts of a 16113Texinfo input file into separate files. 16114 16115This @value{DOCUMENT} is written in Texinfo, the GNU project's document 16116formatting language. A single Texinfo source file can be used to produce both 16117printed and on-line documentation. 16118@iftex 16119Texinfo is fully documented in @cite{Texinfo---The GNU Documentation Format}, 16120available from the Free Software Foundation. 16121@end iftex 16122@ifinfo 16123The Texinfo language is described fully, starting with 16124@ref{Top, , Introduction, texi, Texinfo---The GNU Documentation Format}. 16125@end ifinfo 16126 16127For our purposes, it is enough to know three things about Texinfo input 16128files. 16129 16130@itemize @bullet 16131@item 16132The ``at'' symbol, @samp{@@}, is special in Texinfo, much like @samp{\} in C 16133or @code{awk}. Literal @samp{@@} symbols are represented in Texinfo source 16134files as @samp{@@@@}. 16135 16136@item 16137Comments start with either @samp{@@c} or @samp{@@comment}. 16138The file extraction program will work by using special comments that start 16139at the beginning of a line. 16140 16141@item 16142Example text that should not be split across a page boundary is bracketed 16143between lines containing @samp{@@group} and @samp{@@end group} commands. 16144@end itemize 16145 16146The following program, @file{extract.awk}, reads through a Texinfo source 16147file, and does two things, based on the special comments. 16148Upon seeing @samp{@w{@@c system @dots{}}}, 16149it runs a command, by extracting the command text from the 16150control line and passing it on to the @code{system} function 16151(@pxref{I/O Functions, ,Built-in Functions for Input/Output}). 16152Upon seeing @samp{@@c file @var{filename}}, each subsequent line is sent to 16153the file @var{filename}, until @samp{@@c endfile} is encountered. 16154The rules in @file{extract.awk} will match either @samp{@@c} or 16155@samp{@@comment} by letting the @samp{omment} part be optional. 16156Lines containing @samp{@@group} and @samp{@@end group} are simply removed. 16157@file{extract.awk} uses the @code{join} library function 16158(@pxref{Join Function, ,Merging an Array Into a String}). 16159 16160The example programs in the on-line Texinfo source for @cite{@value{TITLE}} 16161(@file{gawk.texi}) have all been bracketed inside @samp{file}, 16162and @samp{endfile} lines. The @code{gawk} distribution uses a copy of 16163@file{extract.awk} to extract the sample 16164programs and install many of them in a standard directory, where 16165@code{gawk} can find them. 16166The Texinfo file looks something like this: 16167 16168@example 16169@dots{} 16170This program has a @@code@{BEGIN@} block, 16171which prints a nice message: 16172 16173@@example 16174@@c file examples/messages.awk 16175BEGIN @@@{ print "Don't panic!" @@@} 16176@@c end file 16177@@end example 16178 16179It also prints some final advice: 16180 16181@@example 16182@@c file examples/messages.awk 16183END @@@{ print "Always avoid bored archeologists!" @@@} 16184@@c end file 16185@@end example 16186@dots{} 16187@end example 16188 16189@file{extract.awk} begins by setting @code{IGNORECASE} to one, so that 16190mixed upper-case and lower-case letters in the directives won't matter. 16191 16192The first rule handles calling @code{system}, checking that a command was 16193given (@code{NF} is at least three), and also checking that the command 16194exited with a zero exit status, signifying OK. 16195 16196@findex extract.awk 16197@example 16198@c @group 16199@c file eg/prog/extract.awk 16200# extract.awk --- extract files and run programs 16201# from texinfo files 16202# Arnold Robbins, arnold@@gnu.org, Public Domain, May 1993 16203 16204BEGIN @{ IGNORECASE = 1 @} 16205 16206@group 16207/^@@c(omment)?[ \t]+system/ \ 16208@{ 16209 if (NF < 3) @{ 16210 e = (FILENAME ":" FNR) 16211 e = (e ": badly formed `system' line") 16212 print e > "/dev/stderr" 16213 next 16214 @} 16215 $1 = "" 16216 $2 = "" 16217 stat = system($0) 16218 if (stat != 0) @{ 16219 e = (FILENAME ":" FNR) 16220 e = (e ": warning: system returned " stat) 16221 print e > "/dev/stderr" 16222 @} 16223@} 16224@end group 16225@c endfile 16226@end example 16227 16228@noindent 16229The variable @code{e} is used so that the function 16230fits nicely on the 16231@iftex 16232page. 16233@end iftex 16234@ifinfo 16235screen. 16236@end ifinfo 16237 16238The second rule handles moving data into files. It verifies that a file 16239name was given in the directive. If the file named is not the current file, 16240then the current file is closed. This means that an @samp{@@c endfile} was 16241not given for that file. (We should probably print a diagnostic in this 16242case, although at the moment we do not.) 16243 16244The @samp{for} loop does the work. It reads lines using @code{getline} 16245(@pxref{Getline, ,Explicit Input with @code{getline}}). 16246For an unexpected end of file, it calls the @code{@w{unexpected_eof}} 16247function. If the line is an ``endfile'' line, then it breaks out of 16248the loop. 16249If the line is an @samp{@@group} or @samp{@@end group} line, then it 16250ignores it, and goes on to the next line. 16251(These Texinfo control lines keep blocks of code together on one page; 16252unfortunately, @TeX{} isn't always smart enough to do things exactly right, 16253and we have to give it some advice.) 16254 16255Most of the work is in the following few lines. If the line has no @samp{@@} 16256symbols, it can be printed directly. Otherwise, each leading @samp{@@} must be 16257stripped off. 16258 16259To remove the @samp{@@} symbols, the line is split into separate elements of 16260the array @code{a}, using the @code{split} function 16261(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 16262Each element of @code{a} that is empty indicates two successive @samp{@@} 16263symbols in the original line. For each two empty elements (@samp{@@@@} in 16264the original file), we have to add back in a single @samp{@@} symbol. 16265 16266When the processing of the array is finished, @code{join} is called with the 16267value of @code{SUBSEP}, to rejoin the pieces back into a single 16268line. That line is then printed to the output file. 16269 16270@example 16271@c @group 16272@c file eg/prog/extract.awk 16273@group 16274/^@@c(omment)?[ \t]+file/ \ 16275@{ 16276 if (NF != 3) @{ 16277 e = (FILENAME ":" FNR ": badly formed `file' line") 16278 print e > "/dev/stderr" 16279 next 16280 @} 16281@end group 16282 if ($3 != curfile) @{ 16283 if (curfile != "") 16284 close(curfile) 16285 curfile = $3 16286 @} 16287 16288 for (;;) @{ 16289 if ((getline line) <= 0) 16290 unexpected_eof() 16291 if (line ~ /^@@c(omment)?[ \t]+endfile/) 16292 break 16293 else if (line ~ /^@@(end[ \t]+)?group/) 16294 continue 16295 if (index(line, "@@") == 0) @{ 16296 print line > curfile 16297 continue 16298 @} 16299 n = split(line, a, "@@") 16300@group 16301 # if a[1] == "", means leading @@, 16302 # don't add one back in. 16303@end group 16304 for (i = 2; i <= n; i++) @{ 16305 if (a[i] == "") @{ # was an @@@@ 16306 a[i] = "@@" 16307 if (a[i+1] == "") 16308 i++ 16309 @} 16310 @} 16311 print join(a, 1, n, SUBSEP) > curfile 16312 @} 16313@} 16314@c endfile 16315@c @end group 16316@end example 16317 16318An important thing to note is the use of the @samp{>} redirection. 16319Output done with @samp{>} only opens the file once; it stays open and 16320subsequent output is appended to the file 16321(@pxref{Redirection, , Redirecting Output of @code{print} and @code{printf}}). 16322This allows us to easily mix program text and explanatory prose for the same 16323sample source file (as has been done here!) without any hassle. The file is 16324only closed when a new data file name is encountered, or at the end of the 16325input file. 16326 16327Finally, the function @code{@w{unexpected_eof}} prints an appropriate 16328error message and then exits. 16329 16330The @code{END} rule handles the final cleanup, closing the open file. 16331 16332@example 16333@c file eg/prog/extract.awk 16334@group 16335function unexpected_eof() 16336@{ 16337 printf("%s:%d: unexpected EOF or error\n", \ 16338 FILENAME, FNR) > "/dev/stderr" 16339 exit 1 16340@} 16341@end group 16342 16343END @{ 16344 if (curfile) 16345 close(curfile) 16346@} 16347@c endfile 16348@end example 16349 16350@node Simple Sed, Igawk Program, Extract Program, Miscellaneous Programs 16351@subsection A Simple Stream Editor 16352 16353@cindex @code{sed} utility 16354The @code{sed} utility is a ``stream editor,'' a program that reads a 16355stream of data, makes changes to it, and passes the modified data on. 16356It is often used to make global changes to a large file, or to a stream 16357of data generated by a pipeline of commands. 16358 16359While @code{sed} is a complicated program in its own right, its most common 16360use is to perform global substitutions in the middle of a pipeline: 16361 16362@example 16363command1 < orig.data | sed 's/old/new/g' | command2 > result 16364@end example 16365 16366Here, the @samp{s/old/new/g} tells @code{sed} to look for the regexp 16367@samp{old} on each input line, and replace it with the text @samp{new}, 16368globally (i.e.@: all the occurrences on a line). This is similar to 16369@code{awk}'s @code{gsub} function 16370(@pxref{String Functions, , Built-in Functions for String Manipulation}). 16371 16372The following program, @file{awksed.awk}, accepts at least two command line 16373arguments; the pattern to look for and the text to replace it with. Any 16374additional arguments are treated as data file names to process. If none 16375are provided, the standard input is used. 16376 16377@cindex Brennan, Michael 16378@cindex @code{awksed} 16379@cindex simple stream editor 16380@cindex stream editor, simple 16381@example 16382@c @group 16383@c file eg/prog/awksed.awk 16384# awksed.awk --- do s/foo/bar/g using just print 16385# Thanks to Michael Brennan for the idea 16386 16387# Arnold Robbins, arnold@@gnu.org, Public Domain 16388# August 1995 16389 16390function usage() 16391@{ 16392 print "usage: awksed pat repl [files...]" > "/dev/stderr" 16393 exit 1 16394@} 16395 16396@group 16397BEGIN @{ 16398 # validate arguments 16399 if (ARGC < 3) 16400 usage() 16401@end group 16402 16403 RS = ARGV[1] 16404 ORS = ARGV[2] 16405 16406 # don't use arguments as files 16407 ARGV[1] = ARGV[2] = "" 16408@} 16409 16410# look ma, no hands! 16411@{ 16412 if (RT == "") 16413 printf "%s", $0 16414 else 16415 print 16416@} 16417@c endfile 16418@c @end group 16419@end example 16420 16421The program relies on @code{gawk}'s ability to have @code{RS} be a regexp 16422and on the setting of @code{RT} to the actual text that terminated the 16423record (@pxref{Records, ,How Input is Split into Records}). 16424 16425The idea is to have @code{RS} be the pattern to look for. @code{gawk} 16426will automatically set @code{$0} to the text between matches of the pattern. 16427This is text that we wish to keep, unmodified. Then, by setting @code{ORS} 16428to the replacement text, a simple @code{print} statement will output the 16429text we wish to keep, followed by the replacement text. 16430 16431There is one wrinkle to this scheme, which is what to do if the last record 16432doesn't end with text that matches @code{RS}? Using a @code{print} 16433statement unconditionally prints the replacement text, which is not correct. 16434 16435However, if the file did not end in text that matches @code{RS}, @code{RT} 16436will be set to the null string. In this case, we can print @code{$0} using 16437@code{printf} 16438(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}). 16439 16440The @code{BEGIN} rule handles the setup, checking for the right number 16441of arguments, and calling @code{usage} if there is a problem. Then it sets 16442@code{RS} and @code{ORS} from the command line arguments, and sets 16443@code{ARGV[1]} and @code{ARGV[2]} to the null string, so that they will 16444not be treated as file names 16445(@pxref{ARGC and ARGV, , Using @code{ARGC} and @code{ARGV}}). 16446 16447The @code{usage} function prints an error message and exits. 16448 16449Finally, the single rule handles the printing scheme outlined above, 16450using @code{print} or @code{printf} as appropriate, depending upon the 16451value of @code{RT}. 16452 16453@ignore 16454Exercise, compare the performance of this version with the more 16455straightforward: 16456 16457BEGIN { 16458 pat = ARGV[1] 16459 repl = ARGV[2] 16460 ARGV[1] = ARGV[2] = "" 16461} 16462 16463{ gsub(pat, repl); print } 16464 16465Exercise: what are the advantages and disadvantages of this version vs. sed? 16466 Advantage: egrep regexps 16467 speed (?) 16468 Disadvantage: no & in replacement text 16469 16470Others? 16471@end ignore 16472 16473@node Igawk Program, , Simple Sed, Miscellaneous Programs 16474@subsection An Easy Way to Use Library Functions 16475 16476Using library functions in @code{awk} can be very beneficial. It 16477encourages code re-use and the writing of general functions. Programs are 16478smaller, and therefore clearer. 16479However, using library functions is only easy when writing @code{awk} 16480programs; it is painful when running them, requiring multiple @samp{-f} 16481options. If @code{gawk} is unavailable, then so too is the @code{AWKPATH} 16482environment variable and the ability to put @code{awk} functions into a 16483library directory (@pxref{Options, ,Command Line Options}). 16484 16485It would be nice to be able to write programs like so: 16486 16487@example 16488# library functions 16489@@include getopt.awk 16490@@include join.awk 16491@dots{} 16492 16493# main program 16494BEGIN @{ 16495 while ((c = getopt(ARGC, ARGV, "a:b:cde")) != -1) 16496 @dots{} 16497 @dots{} 16498@} 16499@end example 16500 16501The following program, @file{igawk.sh}, provides this service. 16502It simulates @code{gawk}'s searching of the @code{AWKPATH} variable, 16503and also allows @dfn{nested} includes; i.e.@: a file that has been included 16504with @samp{@@include} can contain further @samp{@@include} statements. 16505@code{igawk} will make an effort to only include files once, so that nested 16506includes don't accidentally include a library function twice. 16507 16508@code{igawk} should behave externally just like @code{gawk}. This means it 16509should accept all of @code{gawk}'s command line arguments, including the 16510ability to have multiple source files specified via @samp{-f}, and the 16511ability to mix command line and library source files. 16512 16513The program is written using the POSIX Shell (@code{sh}) command language. 16514The way the program works is as follows: 16515 16516@enumerate 16517@item 16518Loop through the arguments, saving anything that doesn't represent 16519@code{awk} source code for later, when the expanded program is run. 16520 16521@item 16522For any arguments that do represent @code{awk} text, put the arguments into 16523a temporary file that will be expanded. There are two cases. 16524 16525@enumerate a 16526@item 16527Literal text, provided with @samp{--source} or @samp{--source=}. This 16528text is just echoed directly. The @code{echo} program will automatically 16529supply a trailing newline. 16530 16531@item 16532File names provided with @samp{-f}. We use a neat trick, and echo 16533@samp{@@include @var{filename}} into the temporary file. Since the file 16534inclusion program will work the way @code{gawk} does, this will get the text 16535of the file included into the program at the correct point. 16536@end enumerate 16537 16538@item 16539Run an @code{awk} program (naturally) over the temporary file to expand 16540@samp{@@include} statements. The expanded program is placed in a second 16541temporary file. 16542 16543@item 16544Run the expanded program with @code{gawk} and any other original command line 16545arguments that the user supplied (such as the data file names). 16546@end enumerate 16547 16548The initial part of the program turns on shell tracing if the first 16549argument was @samp{debug}. Otherwise, a shell @code{trap} statement 16550arranges to clean up any temporary files on program exit or upon an 16551interrupt. 16552 16553@c 2e: For the temporary file handling, use mktemp with $@{TMPDIR:-/tmp@}. 16554 16555The next part loops through all the command line arguments. 16556There are several cases of interest. 16557 16558@table @code 16559@item -- 16560This ends the arguments to @code{igawk}. Anything else should be passed on 16561to the user's @code{awk} program without being evaluated. 16562 16563@item -W 16564This indicates that the next option is specific to @code{gawk}. To make 16565argument processing easier, the @samp{-W} is appended to the front of the 16566remaining arguments and the loop continues. (This is an @code{sh} 16567programming trick. Don't worry about it if you are not familiar with 16568@code{sh}.) 16569 16570@item -v 16571@itemx -F 16572These are saved and passed on to @code{gawk}. 16573 16574@item -f 16575@itemx --file 16576@itemx --file= 16577@itemx -Wfile= 16578The file name is saved to a temporary file with an 16579@samp{@@include} statement. 16580The @code{sed} utility is used to remove the leading option part of the 16581argument (e.g., @samp{--file=}). 16582 16583@item --source 16584@itemx --source= 16585@itemx -Wsource= 16586The source text is echoed into a temporary file. 16587 16588@item --version 16589@itemx -Wversion 16590@code{igawk} prints its version number, and runs @samp{gawk --version} 16591to get the @code{gawk} version information, and then exits. 16592@end table 16593 16594If none of @samp{-f}, @samp{--file}, @samp{-Wfile}, @samp{--source}, 16595or @samp{-Wsource}, were supplied, then the first non-option argument 16596should be the @code{awk} program. If there are no command line 16597arguments left, @code{igawk} prints an error message and exits. 16598Otherwise, the first argument is echoed into a temporary file. 16599 16600In any case, after the arguments have been processed, 16601the complete text of the original @code{awk} program 16602is contained in a temporary file. 16603 16604@cindex @code{sed} utility 16605Here's the program: 16606 16607@findex igawk.sh 16608@example 16609@c @group 16610@c file eg/prog/igawk.sh 16611#! /bin/sh 16612 16613# igawk --- like gawk but do @@include processing 16614# Arnold Robbins, arnold@@gnu.org, Public Domain 16615# July 1993 16616 16617# Temporary file handling modifications for Owl by 16618# Jarno Huuskonen and Solar Designer, still Public Domain 16619# May 2001 16620 16621if [ ! -x /bin/mktemp ]; then 16622 echo "$0 needs mktemp to create temporary files." 16623 exit 1 16624fi 16625 16626STEMPFILE=`/bin/mktemp $@{TMPDIR:-/tmp@}/igawk.s.XXXXXX` || exit 1 16627ETEMPFILE=`/bin/mktemp $@{TMPDIR:-/tmp@}/igawk.e.XXXXXX` || exit 1 16628 16629if [ "$1" = debug ] 16630then 16631 set -x 16632 shift 16633else 16634 # cleanup on exit, hangup, interrupt, quit, termination 16635 trap 'rm -f $STEMPFILE $ETEMPFILE' EXIT HUP INT QUIT TERM 16636fi 16637 16638while [ $# -ne 0 ] # loop over arguments 16639do 16640 case $1 in 16641 --) shift; break;; 16642 16643 -W) shift 16644 set -- -W"$@@" 16645 continue;; 16646 16647 -[vF]) opts="$opts $1 '$2'" 16648 shift;; 16649 16650 -[vF]*) opts="$opts '$1'" ;; 16651 16652 -f) echo @@include "$2" >> $STEMPFILE 16653 shift;; 16654 16655@group 16656 -f*) f=`echo "$1" | sed 's/-f//'` 16657 echo @@include "$f" >> $STEMPFILE ;; 16658@end group 16659 16660 -?file=*) # -Wfile or --file 16661 f=`echo "$1" | sed 's/-.file=//'` 16662 echo @@include "$f" >> $STEMPFILE ;; 16663 16664 -?file) # get arg, $2 16665 echo @@include "$2" >> $STEMPFILE 16666 shift;; 16667 16668 -?source=*) # -Wsource or --source 16669 t=`echo "$1" | sed 's/-.source=//'` 16670 echo "$t" >> $STEMPFILE ;; 16671 16672 -?source) # get arg, $2 16673 echo "$2" >> $STEMPFILE 16674 shift;; 16675 16676 -?version) 16677 echo igawk: version 1.0 1>&2 16678 gawk --version 16679 exit 0 ;; 16680 16681 -[W-]*) opts="$opts '$1'" ;; 16682 16683 *) break;; 16684 esac 16685 shift 16686done 16687 16688if [ ! -s $STEMPFILE ] 16689then 16690 if [ -z "$1" ] 16691 then 16692 echo igawk: no program! 1>&2 16693 exit 1 16694 else 16695 echo "$1" > $STEMPFILE 16696 shift 16697 fi 16698fi 16699 16700# at this point, $STEMPFILE has the program 16701@c endfile 16702@c @end group 16703@end example 16704 16705The @code{awk} program to process @samp{@@include} directives reads through 16706the program, one line at a time using @code{getline} 16707(@pxref{Getline, ,Explicit Input with @code{getline}}). 16708The input file names and @samp{@@include} statements are managed using a 16709stack. As each @samp{@@include} is encountered, the current file name is 16710``pushed'' onto the stack, and the file named in the @samp{@@include} 16711directive becomes 16712the current file name. As each file is finished, the stack is ``popped,'' 16713and the previous input file becomes the current input file again. 16714The process is started by making the original file the first one on the 16715stack. 16716 16717The @code{pathto} function does the work of finding the full path to a 16718file. It simulates @code{gawk}'s behavior when searching the @code{AWKPATH} 16719environment variable 16720(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}). 16721If a file name has a @samp{/} in it, no path search 16722is done. Otherwise, the file name is concatenated with the name of each 16723directory in the path, and an attempt is made to open the generated file 16724name. The only way in @code{awk} to test if a file can be read is to go 16725ahead and try to read it with @code{getline}; that is what @code{pathto} 16726does.@footnote{On some very old versions of @code{awk}, the test 16727@samp{getline junk < t} can loop forever if the file exists but is empty. 16728Caveat Emptor.} 16729If the file can be read, it is closed, and the file name is 16730returned. 16731@ignore 16732An alternative way to test for the file's existence would be to call 16733@samp{system("test -r " t)}, which uses the @code{test} utility to 16734see if the file exists and is readable. The disadvantage to this method 16735is that it requires creating an extra process, and can thus be slightly 16736slower. 16737@end ignore 16738 16739@example 16740@c file eg/prog/igawk.sh 16741gawk -- ' 16742# process @@include directives 16743@c endfile 16744 16745@group 16746@c file eg/prog/igawk.sh 16747function pathto(file, i, t, junk) 16748@{ 16749 if (index(file, "/") != 0) 16750 return file 16751 16752 for (i = 1; i <= ndirs; i++) @{ 16753 t = (pathlist[i] "/" file) 16754 if ((getline junk < t) > 0) @{ 16755 # found it 16756 close(t) 16757 return t 16758 @} 16759 @} 16760 return "" 16761@} 16762@c endfile 16763@end group 16764@end example 16765 16766The main program is contained inside one @code{BEGIN} rule. The first thing it 16767does is set up the @code{pathlist} array that @code{pathto} uses. After 16768splitting the path on @samp{:}, null elements are replaced with @code{"."}, 16769which represents the current directory. 16770 16771@example 16772@group 16773@c file eg/prog/igawk.sh 16774BEGIN @{ 16775 path = ENVIRON["AWKPATH"] 16776 ndirs = split(path, pathlist, ":") 16777 for (i = 1; i <= ndirs; i++) @{ 16778 if (pathlist[i] == "") 16779 pathlist[i] = "." 16780 @} 16781@c endfile 16782@end group 16783@end example 16784 16785The stack is initialized with @code{ARGV[1]}, which will be @file{$STEMPFILE}. 16786The main loop comes next. Input lines are read in succession. Lines that 16787do not start with @samp{@@include} are printed verbatim. 16788 16789If the line does start with @samp{@@include}, the file name is in @code{$2}. 16790@code{pathto} is called to generate the full path. If it could not, then we 16791print an error message and continue. 16792 16793The next thing to check is if the file has been included already. The 16794@code{processed} array is indexed by the full file name of each included 16795file, and it tracks this information for us. If the file has been 16796seen, a warning message is printed. Otherwise, the new file name is 16797pushed onto the stack and processing continues. 16798 16799Finally, when @code{getline} encounters the end of the input file, the file 16800is closed and the stack is popped. When @code{stackptr} is less than zero, 16801the program is done. 16802 16803@example 16804@c @group 16805@c file eg/prog/igawk.sh 16806 stackptr = 0 16807 input[stackptr] = ARGV[1] # ARGV[1] is first file 16808 16809 for (; stackptr >= 0; stackptr--) @{ 16810 while ((getline < input[stackptr]) > 0) @{ 16811 if (tolower($1) != "@@include") @{ 16812 print 16813 continue 16814 @} 16815 fpath = pathto($2) 16816 if (fpath == "") @{ 16817 printf("igawk:%s:%d: cannot find %s\n", \ 16818 input[stackptr], FNR, $2) > "/dev/stderr" 16819 continue 16820 @} 16821@group 16822 if (! (fpath in processed)) @{ 16823 processed[fpath] = input[stackptr] 16824 input[++stackptr] = fpath 16825 @} else 16826 print $2, "included in", input[stackptr], \ 16827 "already included in", \ 16828 processed[fpath] > "/dev/stderr" 16829 @} 16830@end group 16831@group 16832 close(input[stackptr]) 16833 @} 16834@}' $STEMPFILE > $ETEMPFILE 16835@end group 16836@c endfile 16837@c @end group 16838@end example 16839 16840The last step is to call @code{gawk} with the expanded program and the original 16841options and command line arguments that the user supplied. @code{gawk}'s 16842exit status is passed back on to @code{igawk}'s calling program. 16843 16844@c this causes more problems than it solves, so leave it out. 16845@ignore 16846The special file @file{/dev/null} is passed as a data file to @code{gawk} 16847to handle an interesting case. Suppose that the user's program only has 16848a @code{BEGIN} rule, and there are no data files to read. The program should exit without reading any data 16849files. However, suppose that an included library file defines an @code{END} 16850rule of its own. In this case, @code{gawk} will hang, reading standard 16851input. In order to avoid this, @file{/dev/null} is explicitly to the 16852command line. Reading from @file{/dev/null} always returns an immediate 16853end of file indication. 16854 16855@c Hmm. Add /dev/null if $# is 0? Still messes up ARGV. Sigh. 16856@end ignore 16857 16858@example 16859@c @group 16860@c file eg/prog/igawk.sh 16861eval gawk -f $ETEMPFILE $opts -- "$@@" 16862 16863exit $? 16864@c endfile 16865@c @end group 16866@end example 16867 16868This version of @code{igawk} represents my third attempt at this program. 16869There are three key simplifications that made the program work better. 16870 16871@enumerate 16872@item 16873Using @samp{@@include} even for the files named with @samp{-f} makes building 16874the initial collected @code{awk} program much simpler; all the 16875@samp{@@include} processing can be done once. 16876 16877@item 16878The @code{pathto} function doesn't try to save the line read with 16879@code{getline} when testing for the file's accessibility. Trying to save 16880this line for use with the main program complicates things considerably. 16881@c what problem does this engender though - exercise 16882@c answer, reading from "-" or /dev/stdin 16883 16884@item 16885Using a @code{getline} loop in the @code{BEGIN} rule does it all in one 16886place. It is not necessary to call out to a separate loop for processing 16887nested @samp{@@include} statements. 16888@end enumerate 16889 16890Also, this program illustrates that it is often worthwhile to combine 16891@code{sh} and @code{awk} programming together. You can usually accomplish 16892quite a lot, without having to resort to low-level programming in C or C++, and it 16893is frequently easier to do certain kinds of string and argument manipulation 16894using the shell than it is in @code{awk}. 16895 16896Finally, @code{igawk} shows that it is not always necessary to add new 16897features to a program; they can often be layered on top. With @code{igawk}, 16898there is no real reason to build @samp{@@include} processing into 16899@code{gawk} itself. 16900 16901As an additional example of this, consider the idea of having two 16902files in a directory in the search path. 16903 16904@table @file 16905@item default.awk 16906This file would contain a set of default library functions, such 16907as @code{getopt} and @code{assert}. 16908 16909@item site.awk 16910This file would contain library functions that are specific to a site or 16911installation, i.e.@: locally developed functions. 16912Having a separate file allows @file{default.awk} to change with 16913new @code{gawk} releases, without requiring the system administrator to 16914update it each time by adding the local functions. 16915@end table 16916 16917One user 16918@c Karl Berry, karl@ileaf.com, 10/95 16919suggested that @code{gawk} be modified to automatically read these files 16920upon startup. Instead, it would be very simple to modify @code{igawk} 16921to do this. Since @code{igawk} can process nested @samp{@@include} 16922directives, @file{default.awk} could simply contain @samp{@@include} 16923statements for the desired library functions. 16924 16925@c Exercise: make this change 16926 16927@node Language History, Gawk Summary, Sample Programs, Top 16928@chapter The Evolution of the @code{awk} Language 16929 16930This @value{DOCUMENT} describes the GNU implementation of @code{awk}, which follows 16931the POSIX specification. Many @code{awk} users are only familiar 16932with the original @code{awk} implementation in Version 7 Unix. 16933(This implementation was the basis for @code{awk} in Berkeley Unix, 16934through 4.3--Reno. The 4.4 release of Berkeley Unix uses @code{gawk} 2.15.2 16935for its version of @code{awk}.) This chapter briefly describes the 16936evolution of the @code{awk} language, with cross references to other parts 16937of the @value{DOCUMENT} where you can find more information. 16938 16939@menu 16940* V7/SVR3.1:: The major changes between V7 and System V 16941 Release 3.1. 16942* SVR4:: Minor changes between System V Releases 3.1 16943 and 4. 16944* POSIX:: New features from the POSIX standard. 16945* BTL:: New features from the Bell Laboratories 16946 version of @code{awk}. 16947* POSIX/GNU:: The extensions in @code{gawk} not in POSIX 16948 @code{awk}. 16949@end menu 16950 16951@node V7/SVR3.1, SVR4, Language History, Language History 16952@section Major Changes between V7 and SVR3.1 16953 16954The @code{awk} language evolved considerably between the release of 16955Version 7 Unix (1978) and the new version first made generally available in 16956System V Release 3.1 (1987). This section summarizes the changes, with 16957cross-references to further details. 16958 16959@itemize @bullet 16960@item 16961The requirement for @samp{;} to separate rules on a line 16962(@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}). 16963 16964@item 16965User-defined functions, and the @code{return} statement 16966(@pxref{User-defined, ,User-defined Functions}). 16967 16968@item 16969The @code{delete} statement (@pxref{Delete, ,The @code{delete} Statement}). 16970 16971@item 16972The @code{do}-@code{while} statement 16973(@pxref{Do Statement, ,The @code{do}-@code{while} Statement}). 16974 16975@item 16976The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and 16977@code{srand} (@pxref{Numeric Functions, ,Numeric Built-in Functions}). 16978 16979@item 16980The built-in functions @code{gsub}, @code{sub}, and @code{match} 16981(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 16982 16983@item 16984The built-in functions @code{close}, and @code{system} 16985(@pxref{I/O Functions, ,Built-in Functions for Input/Output}). 16986 16987@item 16988The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART}, 16989and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}). 16990 16991@item 16992The conditional expression using the ternary operator @samp{?:} 16993(@pxref{Conditional Exp, ,Conditional Expressions}). 16994 16995@item 16996The exponentiation operator @samp{^} 16997(@pxref{Arithmetic Ops, ,Arithmetic Operators}) and its assignment operator 16998form @samp{^=} (@pxref{Assignment Ops, ,Assignment Expressions}). 16999 17000@item 17001C-compatible operator precedence, which breaks some old @code{awk} 17002programs (@pxref{Precedence, ,Operator Precedence (How Operators Nest)}). 17003 17004@item 17005Regexps as the value of @code{FS} 17006(@pxref{Field Separators, ,Specifying How Fields are Separated}), and as the 17007third argument to the @code{split} function 17008(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 17009 17010@item 17011Dynamic regexps as operands of the @samp{~} and @samp{!~} operators 17012(@pxref{Regexp Usage, ,How to Use Regular Expressions}). 17013 17014@item 17015The escape sequences @samp{\b}, @samp{\f}, and @samp{\r} 17016(@pxref{Escape Sequences}). 17017(Some vendors have updated their old versions of @code{awk} to 17018recognize @samp{\r}, @samp{\b}, and @samp{\f}, but this is not 17019something you can rely on.) 17020 17021@item 17022Redirection of input for the @code{getline} function 17023(@pxref{Getline, ,Explicit Input with @code{getline}}). 17024 17025@item 17026Multiple @code{BEGIN} and @code{END} rules 17027(@pxref{BEGIN/END, ,The @code{BEGIN} and @code{END} Special Patterns}). 17028 17029@item 17030Multi-dimensional arrays 17031(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}). 17032@end itemize 17033 17034@node SVR4, POSIX, V7/SVR3.1, Language History 17035@section Changes between SVR3.1 and SVR4 17036 17037@cindex @code{awk} language, V.4 version 17038The System V Release 4 version of Unix @code{awk} added these features 17039(some of which originated in @code{gawk}): 17040 17041@itemize @bullet 17042@item 17043The @code{ENVIRON} variable (@pxref{Built-in Variables}). 17044 17045@item 17046Multiple @samp{-f} options on the command line 17047(@pxref{Options, ,Command Line Options}). 17048 17049@item 17050The @samp{-v} option for assigning variables before program execution begins 17051(@pxref{Options, ,Command Line Options}). 17052 17053@item 17054The @samp{--} option for terminating command line options. 17055 17056@item 17057The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences 17058(@pxref{Escape Sequences}). 17059 17060@item 17061A defined return value for the @code{srand} built-in function 17062(@pxref{Numeric Functions, ,Numeric Built-in Functions}). 17063 17064@item 17065The @code{toupper} and @code{tolower} built-in string functions 17066for case translation 17067(@pxref{String Functions, ,Built-in Functions for String Manipulation}). 17068 17069@item 17070A cleaner specification for the @samp{%c} format-control letter in the 17071@code{printf} function 17072(@pxref{Control Letters, ,Format-Control Letters}). 17073 17074@item 17075The ability to dynamically pass the field width and precision (@code{"%*.*d"}) 17076in the argument list of the @code{printf} function 17077(@pxref{Control Letters, ,Format-Control Letters}). 17078 17079@item 17080The use of regexp constants such as @code{/foo/} as expressions, where 17081they are equivalent to using the matching operator, as in @samp{$0 ~ /foo/} 17082(@pxref{Using Constant Regexps, ,Using Regular Expression Constants}). 17083@end itemize 17084 17085@node POSIX, BTL, SVR4, Language History 17086@section Changes between SVR4 and POSIX @code{awk} 17087 17088The POSIX Command Language and Utilities standard for @code{awk} 17089introduced the following changes into the language: 17090 17091@itemize @bullet 17092@item 17093The use of @samp{-W} for implementation-specific options. 17094 17095@item 17096The use of @code{CONVFMT} for controlling the conversion of numbers 17097to strings (@pxref{Conversion, ,Conversion of Strings and Numbers}). 17098 17099@item 17100The concept of a numeric string, and tighter comparison rules to go 17101with it (@pxref{Typing and Comparison, ,Variable Typing and Comparison Expressions}). 17102 17103@item 17104More complete documentation of many of the previously undocumented 17105features of the language. 17106@end itemize 17107 17108The following common extensions are not permitted by the POSIX 17109standard: 17110 17111@c IMPORTANT! Keep this list in sync with the one in node Options 17112 17113@itemize @bullet 17114@item 17115@code{\x} escape sequences are not recognized 17116(@pxref{Escape Sequences}). 17117 17118@item 17119Newlines do not act as whitespace to separate fields when @code{FS} is 17120equal to a single space. 17121 17122@item 17123The synonym @code{func} for the keyword @code{function} is not 17124recognized (@pxref{Definition Syntax, ,Function Definition Syntax}). 17125 17126@item 17127The operators @samp{**} and @samp{**=} cannot be used in 17128place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators}, 17129and also @pxref{Assignment Ops, ,Assignment Expressions}). 17130 17131@item 17132Specifying @samp{-Ft} on the command line does not set the value 17133of @code{FS} to be a single tab character 17134(@pxref{Field Separators, ,Specifying How Fields are Separated}). 17135 17136@item 17137The @code{fflush} built-in function is not supported 17138(@pxref{I/O Functions, , Built-in Functions for Input/Output}). 17139@end itemize 17140 17141@node BTL, POSIX/GNU, POSIX, Language History 17142@section Extensions in the Bell Laboratories @code{awk} 17143 17144@cindex Kernighan, Brian 17145Brian Kernighan, one of the original designers of Unix @code{awk}, 17146has made his version available via anonymous @code{ftp} 17147(@pxref{Other Versions, ,Other Freely Available @code{awk} Implementations}). 17148This section describes extensions in his version of @code{awk} that are 17149not in POSIX @code{awk}. 17150 17151@itemize @bullet 17152@item 17153The @samp{-mf @var{NNN}} and @samp{-mr @var{NNN}} command line options 17154to set the maximum number of fields, and the maximum 17155record size, respectively 17156(@pxref{Options, ,Command Line Options}). 17157 17158@item 17159The @code{fflush} built-in function for flushing buffered output 17160(@pxref{I/O Functions, ,Built-in Functions for Input/Output}). 17161 17162@ignore 17163@item 17164The @code{SYMTAB} array, that allows access to the internal symbol 17165table of @code{awk}. This feature is not documented, largely because 17166it is somewhat shakily implemented. For instance, you cannot access arrays 17167or array elements through it. 17168@end ignore 17169@end itemize 17170 17171@node POSIX/GNU, , BTL, Language History 17172@section Extensions in @code{gawk} Not in POSIX @code{awk} 17173 17174@cindex compatibility mode 17175The GNU implementation, @code{gawk}, adds a number of features. 17176This sections lists them in the order they were added to @code{gawk}. 17177They can all be disabled with either the @samp{--traditional} or 17178@samp{--posix} options 17179(@pxref{Options, ,Command Line Options}). 17180 17181Version 2.10 of @code{gawk} introduced these features: 17182 17183@itemize @bullet 17184@item 17185The @code{AWKPATH} environment variable for specifying a path search for 17186the @samp{-f} command line option 17187(@pxref{Options, ,Command Line Options}). 17188 17189@item 17190The @code{IGNORECASE} variable and its effects 17191(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}). 17192 17193@item 17194The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and 17195@file{/dev/fd/@var{n}} file name interpretation 17196(@pxref{Special Files, ,Special File Names in @code{gawk}}). 17197@end itemize 17198 17199Version 2.13 of @code{gawk} introduced these features: 17200 17201@itemize @bullet 17202@item 17203The @code{FIELDWIDTHS} variable and its effects 17204(@pxref{Constant Size, ,Reading Fixed-width Data}). 17205 17206@item 17207The @code{systime} and @code{strftime} built-in functions for obtaining 17208and printing time stamps 17209(@pxref{Time Functions, ,Functions for Dealing with Time Stamps}). 17210 17211@item 17212The @samp{-W lint} option to provide source code and run time error 17213and portability checking 17214(@pxref{Options, ,Command Line Options}). 17215 17216@item 17217The @samp{-W compat} option to turn off these extensions 17218(@pxref{Options, ,Command Line Options}). 17219 17220@item 17221The @samp{-W posix} option for full POSIX compliance 17222(@pxref{Options, ,Command Line Options}). 17223@end itemize 17224 17225Version 2.14 of @code{gawk} introduced these features: 17226 17227@itemize @bullet 17228@item 17229The @code{next file} statement for skipping to the next data file 17230(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}). 17231@end itemize 17232 17233Version 2.15 of @code{gawk} introduced these features: 17234 17235@itemize @bullet 17236@item 17237The @code{ARGIND} variable, that tracks the movement of @code{FILENAME} 17238through @code{ARGV} (@pxref{Built-in Variables}). 17239 17240@item 17241The @code{ERRNO} variable, that contains the system error message when 17242@code{getline} returns @minus{}1, or when @code{close} fails 17243(@pxref{Built-in Variables}). 17244 17245@item 17246The ability to use GNU-style long named options that start with @samp{--} 17247(@pxref{Options, ,Command Line Options}). 17248 17249@item 17250The @samp{--source} option for mixing command line and library 17251file source code 17252(@pxref{Options, ,Command Line Options}). 17253 17254@item 17255The @file{/dev/pid}, @file{/dev/ppid}, @file{/dev/pgrpid}, and 17256@file{/dev/user} file name interpretation 17257(@pxref{Special Files, ,Special File Names in @code{gawk}}). 17258@end itemize 17259 17260Version 3.0 of @code{gawk} introduced these features: 17261 17262@itemize @bullet 17263@item 17264The @code{next file} statement became @code{nextfile} 17265(@pxref{Nextfile Statement, ,The @code{nextfile} Statement}). 17266 17267@item 17268The @samp{--lint-old} option to 17269warn about constructs that are not available in 17270the original Version 7 Unix version of @code{awk} 17271(@pxref{V7/SVR3.1, , Major Changes between V7 and SVR3.1}). 17272 17273@item 17274The @samp{--traditional} option was added as a better name for 17275@samp{--compat} (@pxref{Options, ,Command Line Options}). 17276 17277@item 17278The ability for @code{FS} to be a null string, and for the third 17279argument to @code{split} to be the null string 17280(@pxref{Single Character Fields, , Making Each Character a Separate Field}). 17281 17282@item 17283The ability for @code{RS} to be a regexp 17284(@pxref{Records, , How Input is Split into Records}). 17285 17286@item 17287The @code{RT} variable 17288(@pxref{Records, , How Input is Split into Records}). 17289 17290@item 17291The @code{gensub} function for more powerful text manipulation 17292(@pxref{String Functions, , Built-in Functions for String Manipulation}). 17293 17294@item 17295The @code{strftime} function acquired a default time format, 17296allowing it to be called with no arguments 17297(@pxref{Time Functions, , Functions for Dealing with Time Stamps}). 17298 17299@item 17300Full support for both POSIX and GNU regexps 17301(@pxref{Regexp, , Regular Expressions}). 17302 17303@item 17304The @samp{--re-interval} option to provide interval expressions in regexps 17305(@pxref{Regexp Operators, , Regular Expression Operators}). 17306 17307@item 17308@code{IGNORECASE} changed, now applying to string comparison as well 17309as regexp operations 17310(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}). 17311 17312@item 17313The @samp{-m} option and the @code{fflush} function from the 17314Bell Labs research version of @code{awk} 17315(@pxref{Options, ,Command Line Options}; also 17316@pxref{I/O Functions, ,Built-in Functions for Input/Output}). 17317 17318@item 17319The use of GNU Autoconf to control the configuration process 17320(@pxref{Quick Installation, , Compiling @code{gawk} for Unix}). 17321 17322@item 17323Amiga support 17324(@pxref{Amiga Installation, ,Installing @code{gawk} on an Amiga}). 17325 17326@c XXX ADD MORE STUFF HERE 17327 17328@end itemize 17329 17330@node Gawk Summary, Installation, Language History, Top 17331@appendix @code{gawk} Summary 17332 17333This appendix provides a brief summary of the @code{gawk} command line and the 17334@code{awk} language. It is designed to serve as ``quick reference.'' It is 17335therefore terse, but complete. 17336 17337@menu 17338* Command Line Summary:: Recapitulation of the command line. 17339* Language Summary:: A terse review of the language. 17340* Variables/Fields:: Variables, fields, and arrays. 17341* Rules Summary:: Patterns and Actions, and their component 17342 parts. 17343* Actions Summary:: Quick overview of actions. 17344* Functions Summary:: Defining and calling functions. 17345* Historical Features:: Some undocumented but supported ``features''. 17346@end menu 17347 17348@node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary 17349@appendixsec Command Line Options Summary 17350 17351The command line consists of options to @code{gawk} itself, the 17352@code{awk} program text (if not supplied via the @samp{-f} option), and 17353values to be made available in the @code{ARGC} and @code{ARGV} 17354predefined @code{awk} variables: 17355 17356@example 17357gawk @r{[@var{POSIX or GNU style options}]} -f @var{source-file} @r{[@code{--}]} @var{file} @dots{} 17358gawk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{} 17359@end example 17360 17361The options that @code{gawk} accepts are: 17362 17363@table @code 17364@item -F @var{fs} 17365@itemx --field-separator @var{fs} 17366Use @var{fs} for the input field separator (the value of the @code{FS} 17367predefined variable). 17368 17369@item -f @var{program-file} 17370@itemx --file @var{program-file} 17371Read the @code{awk} program source from the file @var{program-file}, instead 17372of from the first command line argument. 17373 17374@item -mf @var{NNN} 17375@itemx -mr @var{NNN} 17376The @samp{f} flag sets 17377the maximum number of fields, and the @samp{r} flag sets the maximum 17378record size. These options are ignored by @code{gawk}, since @code{gawk} 17379has no predefined limits; they are only for compatibility with the 17380Bell Labs research version of Unix @code{awk}. 17381 17382@item -v @var{var}=@var{val} 17383@itemx --assign @var{var}=@var{val} 17384Assign the variable @var{var} the value @var{val} before program execution 17385begins. 17386 17387@item -W traditional 17388@itemx -W compat 17389@itemx --traditional 17390@itemx --compat 17391Use compatibility mode, in which @code{gawk} extensions are turned 17392off. 17393 17394@item -W copyleft 17395@itemx -W copyright 17396@itemx --copyleft 17397@itemx --copyright 17398Print the short version of the General Public License on the standard 17399output, and exit. This option may disappear in a future version of @code{gawk}. 17400 17401@item -W help 17402@itemx -W usage 17403@itemx --help 17404@itemx --usage 17405Print a relatively short summary of the available options on the standard 17406output, and exit. 17407 17408@item -W lint 17409@itemx --lint 17410Give warnings about dubious or non-portable @code{awk} constructs. 17411 17412@item -W lint-old 17413@itemx --lint-old 17414Warn about constructs that are not available in 17415the original Version 7 Unix version of @code{awk}. 17416 17417@item -W posix 17418@itemx --posix 17419Use POSIX compatibility mode, in which @code{gawk} extensions 17420are turned off and additional restrictions apply. 17421 17422@item -W re-interval 17423@itemx --re-interval 17424Allow interval expressions 17425(@pxref{Regexp Operators, , Regular Expression Operators}), 17426in regexps. 17427 17428@item -W source=@var{program-text} 17429@itemx --source @var{program-text} 17430Use @var{program-text} as @code{awk} program source code. This option allows 17431mixing command line source code with source code from files, and is 17432particularly useful for mixing command line programs with library functions. 17433 17434@item -W version 17435@itemx --version 17436Print version information for this particular copy of @code{gawk} on the error 17437output. 17438 17439@item -- 17440Signal the end of options. This is useful to allow further arguments to the 17441@code{awk} program itself to start with a @samp{-}. This is mainly for 17442consistency with POSIX argument parsing conventions. 17443@end table 17444 17445Any other options are flagged as invalid, but are otherwise ignored. 17446@xref{Options, ,Command Line Options}, for more details. 17447 17448@node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary 17449@appendixsec Language Summary 17450 17451An @code{awk} program consists of a sequence of zero or more pattern-action 17452statements and optional function definitions. One or the other of the 17453pattern and action may be omitted. 17454 17455@example 17456@var{pattern} @{ @var{action statements} @} 17457@var{pattern} 17458 @{ @var{action statements} @} 17459 17460function @var{name}(@var{parameter list}) @{ @var{action statements} @} 17461@end example 17462 17463@code{gawk} first reads the program source from the 17464@var{program-file}(s), if specified, or from the first non-option 17465argument on the command line. The @samp{-f} option may be used multiple 17466times on the command line. @code{gawk} reads the program text from all 17467the @var{program-file} files, effectively concatenating them in the 17468order they are specified. This is useful for building libraries of 17469@code{awk} functions, without having to include them in each new 17470@code{awk} program that uses them. To use a library function in a file 17471from a program typed in on the command line, specify 17472@samp{--source '@var{program}'}, and type your program in between the single 17473quotes. 17474@xref{Options, ,Command Line Options}. 17475 17476The environment variable @code{AWKPATH} specifies a search path to use 17477when finding source files named with the @samp{-f} option. The default 17478path, which is 17479@samp{.:/usr/local/share/awk}@footnote{The path may use a directory 17480other than @file{/usr/local/share/awk}, depending upon how @code{gawk} 17481was built and installed.} is used if @code{AWKPATH} is not set. 17482If a file name given to the @samp{-f} option contains a @samp{/} character, 17483no path search is performed. 17484@xref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}. 17485 17486@code{gawk} compiles the program into an internal form, and then proceeds to 17487read each file named in the @code{ARGV} array. 17488The initial values of @code{ARGV} come from the command line arguments. 17489If there are no files named 17490on the command line, @code{gawk} reads the standard input. 17491 17492If a ``file'' named on the command line has the form 17493@samp{@var{var}=@var{val}}, it is treated as a variable assignment: the 17494variable @var{var} is assigned the value @var{val}. 17495If any of the files have a value that is the null string, that 17496element in the list is skipped. 17497 17498For each record in the input, @code{gawk} tests to see if it matches any 17499@var{pattern} in the @code{awk} program. For each pattern that the record 17500matches, the associated @var{action} is executed. 17501 17502@node Variables/Fields, Rules Summary, Language Summary, Gawk Summary 17503@appendixsec Variables and Fields 17504 17505@code{awk} variables are not declared; they come into existence when they are 17506first used. Their values are either floating-point numbers or strings. 17507@code{awk} also has one-dimensional arrays; multiple-dimensional arrays 17508may be simulated. There are several predefined variables that 17509@code{awk} sets as a program runs; these are summarized below. 17510 17511@menu 17512* Fields Summary:: Input field splitting. 17513* Built-in Summary:: @code{awk}'s built-in variables. 17514* Arrays Summary:: Using arrays. 17515* Data Type Summary:: Values in @code{awk} are numbers or strings. 17516@end menu 17517 17518@node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields 17519@appendixsubsec Fields 17520 17521As each input line is read, @code{gawk} splits the line into 17522@var{fields}, using the value of the @code{FS} variable as the field 17523separator. If @code{FS} is a single character, fields are separated by 17524that character. Otherwise, @code{FS} is expected to be a full regular 17525expression. In the special case that @code{FS} is a single space, 17526fields are separated by runs of spaces, tabs and/or newlines.@footnote{In 17527POSIX @code{awk}, newline does not separate fields.} 17528If @code{FS} is the null string (@code{""}), then each individual 17529character in the record becomes a separate field. 17530Note that the value 17531of @code{IGNORECASE} (@pxref{Case-sensitivity, ,Case-sensitivity in Matching}) 17532also affects how fields are split when @code{FS} is a regular expression. 17533 17534Each field in the input line may be referenced by its position, @code{$1}, 17535@code{$2}, and so on. @code{$0} is the whole line. The value of a field may 17536be assigned to as well. Field numbers need not be constants: 17537 17538@example 17539n = 5 17540print $n 17541@end example 17542 17543@noindent 17544prints the fifth field in the input line. The variable @code{NF} is set to 17545the total number of fields in the input line. 17546 17547References to non-existent fields (i.e.@: fields after @code{$NF}) return 17548the null string. However, assigning to a non-existent field (e.g., 17549@code{$(NF+2) = 5}) increases the value of @code{NF}, creates any 17550intervening fields with the null string as their value, and causes the 17551value of @code{$0} to be recomputed, with the fields being separated by 17552the value of @code{OFS}. 17553Decrementing @code{NF} causes the values of fields past the new value to 17554be lost, and the value of @code{$0} to be recomputed, with the fields being 17555separated by the value of @code{OFS}. 17556@xref{Reading Files, ,Reading Input Files}. 17557 17558@node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields 17559@appendixsubsec Built-in Variables 17560 17561@code{gawk}'s built-in variables are: 17562 17563@table @code 17564@item ARGC 17565The number of elements in @code{ARGV}. See below for what is actually 17566included in @code{ARGV}. 17567 17568@item ARGIND 17569The index in @code{ARGV} of the current file being processed. 17570When @code{gawk} is processing the input data files, 17571it is always true that @samp{FILENAME == ARGV[ARGIND]}. 17572 17573@item ARGV 17574The array of command line arguments. The array is indexed from zero to 17575@code{ARGC} @minus{} 1. Dynamically changing @code{ARGC} and 17576the contents of @code{ARGV} 17577can control the files used for data. A null-valued element in 17578@code{ARGV} is ignored. @code{ARGV} does not include the options to 17579@code{awk} or the text of the @code{awk} program itself. 17580 17581@item CONVFMT 17582The conversion format to use when converting numbers to strings. 17583 17584@item FIELDWIDTHS 17585A space separated list of numbers describing the fixed-width input data. 17586 17587@item ENVIRON 17588An array of environment variable values. The array 17589is indexed by variable name, each element being the value of that 17590variable. Thus, the environment variable @code{HOME} is 17591@code{ENVIRON["HOME"]}. One possible value might be @file{/home/arnold}. 17592 17593Changing this array does not affect the environment seen by programs 17594which @code{gawk} spawns via redirection or the @code{system} function. 17595(This may change in a future version of @code{gawk}.) 17596 17597Some operating systems do not have environment variables. 17598The @code{ENVIRON} array is empty when running on these systems. 17599 17600@item ERRNO 17601The system error message when an error occurs using @code{getline} 17602or @code{close}. 17603 17604@item FILENAME 17605The name of the current input file. If no files are specified on the command 17606line, the value of @code{FILENAME} is the null string. 17607 17608@item FNR 17609The input record number in the current input file. 17610 17611@item FS 17612The input field separator, a space by default. 17613 17614@item IGNORECASE 17615The case-sensitivity flag for string comparisons and regular expression 17616operations. If @code{IGNORECASE} has a non-zero value, then pattern 17617matching in rules, record separating with @code{RS}, field splitting 17618with @code{FS}, regular expression matching with @samp{~} and 17619@samp{!~}, and the @code{gensub}, @code{gsub}, @code{index}, 17620@code{match}, @code{split} and @code{sub} built-in functions all 17621ignore case when doing regular expression operations, and all string 17622comparisons are done ignoring case. 17623The value of @code{IGNORECASE} does @emph{not} affect array subscripting. 17624 17625@item NF 17626The number of fields in the current input record. 17627 17628@item NR 17629The total number of input records seen so far. 17630 17631@item OFMT 17632The output format for numbers for the @code{print} statement, 17633@code{"%.6g"} by default. 17634 17635@item OFS 17636The output field separator, a space by default. 17637 17638@item ORS 17639The output record separator, by default a newline. 17640 17641@item RS 17642The input record separator, by default a newline. 17643If @code{RS} is set to the null string, then records are separated by 17644blank lines. When @code{RS} is set to the null string, then the newline 17645character always acts as a field separator, in addition to whatever value 17646@code{FS} may have. If @code{RS} is set to a multi-character 17647string, it denotes a regexp; input text matching the regexp 17648separates records. 17649 17650@item RT 17651The input text that matched the text denoted by @code{RS}, 17652the record separator. 17653 17654@item RSTART 17655The index of the first character last matched by @code{match}; zero if no match. 17656 17657@item RLENGTH 17658The length of the string last matched by @code{match}; @minus{}1 if no match. 17659 17660@item SUBSEP 17661The string used to separate multiple subscripts in array elements, by 17662default @code{"\034"}. 17663@end table 17664 17665@xref{Built-in Variables}, for more information. 17666 17667@node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields 17668@appendixsubsec Arrays 17669 17670Arrays are subscripted with an expression between square brackets 17671(@samp{[} and @samp{]}). Array subscripts are @emph{always} strings; 17672numbers are converted to strings as necessary, following the standard 17673conversion rules 17674(@pxref{Conversion, ,Conversion of Strings and Numbers}). 17675 17676If you use multiple expressions separated by commas inside the square 17677brackets, then the array subscript is a string consisting of the 17678concatenation of the individual subscript values, converted to strings, 17679separated by the subscript separator (the value of @code{SUBSEP}). 17680 17681The special operator @code{in} may be used in a conditional context 17682to see if an array has an index consisting of a particular value. 17683 17684@example 17685if (val in array) 17686 print array[val] 17687@end example 17688 17689If the array has multiple subscripts, use @samp{(i, j, @dots{}) in @var{array}} 17690to test for existence of an element. 17691 17692The @code{in} construct may also be used in a @code{for} loop to iterate 17693over all the elements of an array. 17694@xref{Scanning an Array, ,Scanning All Elements of an Array}. 17695 17696You can remove an element from an array using the @code{delete} statement. 17697 17698You can clear an entire array using @samp{delete @var{array}}. 17699 17700@xref{Arrays, ,Arrays in @code{awk}}. 17701 17702@node Data Type Summary, , Arrays Summary, Variables/Fields 17703@appendixsubsec Data Types 17704 17705The value of an @code{awk} expression is always either a number 17706or a string. 17707 17708Some contexts (such as arithmetic operators) require numeric 17709values. They convert strings to numbers by interpreting the text 17710of the string as a number. If the string does not look like a 17711number, it converts to zero. 17712 17713Other contexts (such as concatenation) require string values. 17714They convert numbers to strings by effectively printing them 17715with @code{sprintf}. 17716@xref{Conversion, ,Conversion of Strings and Numbers}, for the details. 17717 17718To force conversion of a string value to a number, simply add zero 17719to it. If the value you start with is already a number, this 17720does not change it. 17721 17722To force conversion of a numeric value to a string, concatenate it with 17723the null string. 17724 17725Comparisons are done numerically if both operands are numeric, or if 17726one is numeric and the other is a numeric string. Otherwise one or 17727both operands are converted to strings and a string comparison is 17728performed. Fields, @code{getline} input, @code{FILENAME}, @code{ARGV} 17729elements, @code{ENVIRON} elements and the elements of an array created 17730by @code{split} are the only items that can be numeric strings. String 17731constants, such as @code{"3.1415927"} are not numeric strings, they are 17732string constants. The full rules for comparisons are described in 17733@ref{Typing and Comparison, ,Variable Typing and Comparison Expressions}. 17734 17735Uninitialized variables have the string value @code{""} (the null, or 17736empty, string). In contexts where a number is required, this is 17737equivalent to zero. 17738 17739@xref{Variables}, for more information on variable naming and initialization; 17740@pxref{Conversion, ,Conversion of Strings and Numbers}, for more information 17741on how variable values are interpreted. 17742 17743@node Rules Summary, Actions Summary, Variables/Fields, Gawk Summary 17744@appendixsec Patterns 17745 17746@menu 17747* Pattern Summary:: Quick overview of patterns. 17748* Regexp Summary:: Quick overview of regular expressions. 17749@end menu 17750 17751An @code{awk} program is mostly composed of rules, each consisting of a 17752pattern followed by an action. The action is enclosed in @samp{@{} and 17753@samp{@}}. Either the pattern may be missing, or the action may be 17754missing, but not both. If the pattern is missing, the 17755action is executed for every input record. A missing action is 17756equivalent to @samp{@w{@{ print @}}}, which prints the entire line. 17757 17758@c These paragraphs repeated for both patterns and actions. I don't 17759@c like this, but I also don't see any way around it. Update both copies 17760@c if they need fixing. 17761Comments begin with the @samp{#} character, and continue until the end of the 17762line. Blank lines may be used to separate statements. Statements normally 17763end with a newline; however, this is not the case for lines ending in a 17764@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines 17765ending in @code{do} or @code{else} also have their statements automatically 17766continued on the following line. In other cases, a line can be continued by 17767ending it with a @samp{\}, in which case the newline is ignored. 17768 17769Multiple statements may be put on one line by separating each one with 17770a @samp{;}. 17771This applies to both the statements within the action part of a rule (the 17772usual case), and to the rule statements. 17773 17774@xref{Comments, ,Comments in @code{awk} Programs}, for information on 17775@code{awk}'s commenting convention; 17776@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a 17777description of the line continuation mechanism in @code{awk}. 17778 17779@node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary 17780@appendixsubsec Pattern Summary 17781 17782@code{awk} patterns may be one of the following: 17783 17784@example 17785/@var{regular expression}/ 17786@var{relational expression} 17787@var{pattern} && @var{pattern} 17788@var{pattern} || @var{pattern} 17789@var{pattern} ? @var{pattern} : @var{pattern} 17790(@var{pattern}) 17791! @var{pattern} 17792@var{pattern1}, @var{pattern2} 17793BEGIN 17794END 17795@end example 17796 17797@code{BEGIN} and @code{END} are two special kinds of patterns that are not 17798tested against the input. The action parts of all @code{BEGIN} rules are 17799concatenated as if all the statements had been written in a single @code{BEGIN} 17800rule. They are executed before any of the input is read. Similarly, all the 17801@code{END} rules are concatenated, and executed when all the input is exhausted (or 17802when an @code{exit} statement is executed). @code{BEGIN} and @code{END} 17803patterns cannot be combined with other patterns in pattern expressions. 17804@code{BEGIN} and @code{END} rules cannot have missing action parts. 17805 17806For @code{/@var{regular-expression}/} patterns, the associated statement is 17807executed for each input record that matches the regular expression. Regular 17808expressions are summarized below. 17809 17810A @var{relational expression} may use any of the operators defined below in 17811the section on actions. These generally test whether certain fields match 17812certain regular expressions. 17813 17814The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and,'' 17815logical ``or,'' and logical ``not,'' respectively, as in C. They do 17816short-circuit evaluation, also as in C, and are used for combining more 17817primitive pattern expressions. As in most languages, parentheses may be 17818used to change the order of evaluation. 17819 17820The @samp{?:} operator is like the same operator in C. If the first 17821pattern matches, then the second pattern is matched against the input 17822record; otherwise, the third is matched. Only one of the second and 17823third patterns is matched. 17824 17825The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a 17826range pattern. It matches all input lines starting with a line that 17827matches @var{pattern1}, and continuing until a line that matches 17828@var{pattern2}, inclusive. A range pattern cannot be used as an operand 17829of any of the pattern operators. 17830 17831@xref{Pattern Overview, ,Pattern Elements}. 17832 17833@node Regexp Summary, , Pattern Summary, Rules Summary 17834@appendixsubsec Regular Expressions 17835 17836Regular expressions are based on POSIX EREs (extended regular expressions). 17837The escape sequences allowed in string constants are also valid in 17838regular expressions (@pxref{Escape Sequences}). 17839Regexps are composed of characters as follows: 17840 17841@table @code 17842@item @var{c} 17843matches the character @var{c} (assuming @var{c} is none of the characters 17844listed below). 17845 17846@item \@var{c} 17847matches the literal character @var{c}. 17848 17849@item . 17850matches any character, @emph{including} newline. 17851In strict POSIX mode, @samp{.} does not match the @sc{nul} 17852character, which is a character with all bits equal to zero. 17853 17854@item ^ 17855matches the beginning of a string. 17856 17857@item $ 17858matches the end of a string. 17859 17860@item [@var{abc}@dots{}] 17861matches any of the characters @var{abc}@dots{} (character list). 17862 17863@item [[:@var{class}:]] 17864matches any character in the character class @var{class}. Allowable classes 17865are @code{alnum}, @code{alpha}, @code{blank}, @code{cntrl}, 17866@code{digit}, @code{graph}, @code{lower}, @code{print}, @code{punct}, 17867@code{space}, @code{upper}, and @code{xdigit}. 17868 17869@item [[.@var{symbol}.]] 17870matches the multi-character collating symbol @var{symbol}. 17871@code{gawk} does not currently support collating symbols. 17872 17873@item [[=@var{classname}=]] 17874matches any of the equivalent characters in the current locale named by the 17875equivalence class @var{classname}. 17876@code{gawk} does not currently support equivalence classes. 17877 17878@item [^@var{abc}@dots{}] 17879matches any character except @var{abc}@dots{} (negated 17880character list). 17881 17882@item @var{r1}|@var{r2} 17883matches either @var{r1} or @var{r2} (alternation). 17884 17885@item @var{r1r2} 17886matches @var{r1}, and then @var{r2} (concatenation). 17887 17888@item @var{r}+ 17889matches one or more @var{r}'s. 17890 17891@item @var{r}* 17892matches zero or more @var{r}'s. 17893 17894@item @var{r}? 17895matches zero or one @var{r}'s. 17896 17897@item (@var{r}) 17898matches @var{r} (grouping). 17899 17900@item @var{r}@{@var{n}@} 17901@itemx @var{r}@{@var{n},@} 17902@itemx @var{r}@{@var{n},@var{m}@} 17903matches at least @var{n}, @var{n} to any number, or @var{n} to @var{m} 17904occurrences of @var{r} (interval expressions). 17905 17906@item \y 17907matches the empty string at either the beginning or the 17908end of a word. 17909 17910@item \B 17911matches the empty string within a word. 17912 17913@item \< 17914matches the empty string at the beginning of a word. 17915 17916@item \> 17917matches the empty string at the end of a word. 17918 17919@item \w 17920matches any word-constituent character (alphanumeric characters and 17921the underscore). 17922 17923@item \W 17924matches any character that is not word-constituent. 17925 17926@item \` 17927matches the empty string at the beginning of a buffer (same as a string 17928in @code{gawk}). 17929 17930@item \' 17931matches the empty string at the end of a buffer. 17932@end table 17933 17934The various command line options 17935control how @code{gawk} interprets characters in regexps. 17936 17937@c NOTE!!! Keep this in sync with the same table in the regexp chapter! 17938@table @asis 17939@item No options 17940In the default case, @code{gawk} provide all the facilities of 17941POSIX regexps and the GNU regexp operators described above. 17942However, interval expressions are not supported. 17943 17944@item @code{--posix} 17945Only POSIX regexps are supported, the GNU operators are not special 17946(e.g., @samp{\w} matches a literal @samp{w}). Interval expressions 17947are allowed. 17948 17949@item @code{--traditional} 17950Traditional Unix @code{awk} regexps are matched. The GNU operators 17951are not special, interval expressions are not available, and neither 17952are the POSIX character classes (@code{[[:alnum:]]} and so on). 17953Characters described by octal and hexadecimal escape sequences are 17954treated literally, even if they represent regexp metacharacters. 17955 17956@item @code{--re-interval} 17957Allow interval expressions in regexps, even if @samp{--traditional} 17958has been provided. 17959@end table 17960 17961@xref{Regexp, ,Regular Expressions}. 17962 17963@node Actions Summary, Functions Summary, Rules Summary, Gawk Summary 17964@appendixsec Actions 17965 17966Action statements are enclosed in braces, @samp{@{} and @samp{@}}. 17967A missing action statement is equivalent to @samp{@w{@{ print @}}}. 17968 17969Action statements consist of the usual assignment, conditional, and looping 17970statements found in most languages. The operators, control statements, 17971and Input/Output statements available are similar to those in C. 17972 17973@c These paragraphs repeated for both patterns and actions. I don't 17974@c like this, but I also don't see any way around it. Update both copies 17975@c if they need fixing. 17976Comments begin with the @samp{#} character, and continue until the end of the 17977line. Blank lines may be used to separate statements. Statements normally 17978end with a newline; however, this is not the case for lines ending in a 17979@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines 17980ending in @code{do} or @code{else} also have their statements automatically 17981continued on the following line. In other cases, a line can be continued by 17982ending it with a @samp{\}, in which case the newline is ignored. 17983 17984Multiple statements may be put on one line by separating each one with 17985a @samp{;}. 17986This applies to both the statements within the action part of a rule (the 17987usual case), and to the rule statements. 17988 17989@xref{Comments, ,Comments in @code{awk} Programs}, for information on 17990@code{awk}'s commenting convention; 17991@pxref{Statements/Lines, ,@code{awk} Statements Versus Lines}, for a 17992description of the line continuation mechanism in @code{awk}. 17993 17994@menu 17995* Operator Summary:: @code{awk} operators. 17996* Control Flow Summary:: The control statements. 17997* I/O Summary:: The I/O statements. 17998* Printf Summary:: A summary of @code{printf}. 17999* Special File Summary:: Special file names interpreted internally. 18000* Built-in Functions Summary:: Built-in numeric and string functions. 18001* Time Functions Summary:: Built-in time functions. 18002* String Constants Summary:: Escape sequences in strings. 18003@end menu 18004 18005@node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary 18006@appendixsubsec Operators 18007 18008The operators in @code{awk}, in order of decreasing precedence, are: 18009 18010@table @code 18011@item (@dots{}) 18012Grouping. 18013 18014@item $ 18015Field reference. 18016 18017@item ++ -- 18018Increment and decrement, both prefix and postfix. 18019 18020@item ^ 18021Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment 18022operator, but they are not specified in the POSIX standard). 18023 18024@item + - ! 18025Unary plus, unary minus, and logical negation. 18026 18027@item * / % 18028Multiplication, division, and modulus. 18029 18030@item + - 18031Addition and subtraction. 18032 18033@item @var{space} 18034String concatenation. 18035 18036@item < <= > >= != == 18037The usual relational operators. 18038 18039@item ~ !~ 18040Regular expression match, negated match. 18041 18042@item in 18043Array membership. 18044 18045@item && 18046Logical ``and''. 18047 18048@item || 18049Logical ``or''. 18050 18051@item ?: 18052A conditional expression. This has the form @samp{@var{expr1} ? 18053@var{expr2} : @var{expr3}}. If @var{expr1} is true, the value of the 18054expression is @var{expr2}; otherwise it is @var{expr3}. Only one of 18055@var{expr2} and @var{expr3} is evaluated. 18056 18057@item = += -= *= /= %= ^= 18058Assignment. Both absolute assignment (@code{@var{var}=@var{value}}) 18059and operator assignment (the other forms) are supported. 18060@end table 18061 18062@xref{Expressions}. 18063 18064@node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary 18065@appendixsubsec Control Statements 18066 18067The control statements are as follows: 18068 18069@example 18070if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]} 18071while (@var{condition}) @var{statement} 18072do @var{statement} while (@var{condition}) 18073for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement} 18074for (@var{var} in @var{array}) @var{statement} 18075break 18076continue 18077delete @var{array}[@var{index}] 18078delete @var{array} 18079exit @r{[} @var{expression} @r{]} 18080@{ @var{statements} @} 18081@end example 18082 18083@xref{Statements, ,Control Statements in Actions}. 18084 18085@node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary 18086@appendixsubsec I/O Statements 18087 18088The Input/Output statements are as follows: 18089 18090@table @code 18091@item getline 18092Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}. 18093@xref{Getline, ,Explicit Input with @code{getline}}. 18094 18095@item getline <@var{file} 18096Set @code{$0} from next record of @var{file}; set @code{NF}. 18097 18098@item getline @var{var} 18099Set @var{var} from next input record; set @code{NR}, @code{FNR}. 18100 18101@item getline @var{var} <@var{file} 18102Set @var{var} from next record of @var{file}. 18103 18104@item @var{command} | getline 18105Run @var{command}, piping its output into @code{getline}; sets @code{$0}, 18106@code{NF}, @code{NR}. 18107 18108@item @var{command} | getline @code{var} 18109Run @var{command}, piping its output into @code{getline}; sets @var{var}. 18110 18111@item next 18112Stop processing the current input record. The next input record is read and 18113processing starts over with the first pattern in the @code{awk} program. 18114If the end of the input data is reached, the @code{END} rule(s), if any, 18115are executed. 18116@xref{Next Statement, ,The @code{next} Statement}. 18117 18118@item nextfile 18119Stop processing the current input file. The next input record read comes 18120from the next input file. @code{FILENAME} is updated, @code{FNR} is set to one, 18121@code{ARGIND} is incremented, 18122and processing starts over with the first pattern in the @code{awk} program. 18123If the end of the input data is reached, the @code{END} rule(s), if any, 18124are executed. 18125Earlier versions of @code{gawk} used @samp{next file}; this usage is still 18126supported, but is considered to be deprecated. 18127@xref{Nextfile Statement, ,The @code{nextfile} Statement}. 18128 18129@item print 18130Prints the current record. 18131@xref{Printing, ,Printing Output}. 18132 18133@item print @var{expr-list} 18134Prints expressions. 18135 18136@item print @var{expr-list} > @var{file} 18137Prints expressions to @var{file}. If @var{file} does not exist, it is 18138created. If it does exist, its contents are deleted the first time the 18139@code{print} is executed. 18140 18141@item print @var{expr-list} >> @var{file} 18142Prints expressions to @var{file}. The previous contents of @var{file} 18143are retained, and the output of @code{print} is appended to the file. 18144 18145@item print @var{expr-list} | @var{command} 18146Prints expressions, sending the output down a pipe to @var{command}. 18147The pipeline to the command stays open until the @code{close} function 18148is called. 18149 18150@item printf @var{fmt}, @var{expr-list} 18151Format and print. 18152 18153@item printf @var{fmt}, @var{expr-list} > @var{file} 18154Format and print to @var{file}. If @var{file} does not exist, it is 18155created. If it does exist, its contents are deleted the first time the 18156@code{printf} is executed. 18157 18158@item printf @var{fmt}, @var{expr-list} >> @var{file} 18159Format and print to @var{file}. The previous contents of @var{file} 18160are retained, and the output of @code{printf} is appended to the file. 18161 18162@item printf @var{fmt}, @var{expr-list} | @var{command} 18163Format and print, sending the output down a pipe to @var{command}. 18164The pipeline to the command stays open until the @code{close} function 18165is called. 18166@end table 18167 18168@code{getline} returns zero on end of file, and @minus{}1 on an error. 18169In the event of an error, @code{getline} will set @code{ERRNO} to 18170the value of a system-dependent string that describes the error. 18171 18172@node Printf Summary, Special File Summary, I/O Summary, Actions Summary 18173@appendixsubsec @code{printf} Summary 18174 18175Conversion specification have the form 18176@code{%}[@var{flag}][@var{width}][@code{.}@var{prec}]@var{format}. 18177@c whew! 18178Items in brackets are optional. 18179 18180The @code{awk} @code{printf} statement and @code{sprintf} function 18181accept the following conversion specification formats: 18182 18183@table @code 18184@item %c 18185An ASCII character. If the argument used for @samp{%c} is numeric, it is 18186treated as a character and printed. Otherwise, the argument is assumed to 18187be a string, and the only first character of that string is printed. 18188 18189@item %d 18190@itemx %i 18191A decimal number (the integer part). 18192 18193@item %e 18194@itemx %E 18195A floating point number of the form 18196@samp{@r{[}-@r{]}d.dddddde@r{[}+-@r{]}dd}. 18197The @samp{%E} format uses @samp{E} instead of @samp{e}. 18198 18199@item %f 18200A floating point number of the form 18201@r{[}@code{-}@r{]}@code{ddd.dddddd}. 18202 18203@item %g 18204@itemx %G 18205Use either the @samp{%e} or @samp{%f} formats, whichever produces a shorter 18206string, with non-significant zeros suppressed. 18207@samp{%G} will use @samp{%E} instead of @samp{%e}. 18208 18209@item %o 18210An unsigned octal number (also an integer). 18211 18212@item %u 18213An unsigned decimal number (again, an integer). 18214 18215@item %s 18216A character string. 18217 18218@item %x 18219@itemx %X 18220An unsigned hexadecimal number (an integer). 18221The @samp{%X} format uses @samp{A} through @samp{F} instead of 18222@samp{a} through @samp{f} for decimal 10 through 15. 18223 18224@item %% 18225A single @samp{%} character; no argument is converted. 18226@end table 18227 18228There are optional, additional parameters that may lie between the @samp{%} 18229and the control letter: 18230 18231@table @code 18232@item - 18233The expression should be left-justified within its field. 18234 18235@item @var{space} 18236For numeric conversions, prefix positive values with a space, and 18237negative values with a minus sign. 18238 18239@item + 18240The plus sign, used before the width modifier (see below), 18241says to always supply a sign for numeric conversions, even if the data 18242to be formatted is positive. The @samp{+} overrides the space modifier. 18243 18244@item # 18245Use an ``alternate form'' for certain control letters. 18246For @samp{o}, supply a leading zero. 18247For @samp{x}, and @samp{X}, supply a leading @samp{0x} or @samp{0X} for 18248a non-zero result. 18249For @samp{e}, @samp{E}, and @samp{f}, the result will always contain a 18250decimal point. 18251For @samp{g}, and @samp{G}, trailing zeros are not removed from the result. 18252 18253@item 0 18254A leading @samp{0} (zero) acts as a flag, that indicates output should be 18255padded with zeros instead of spaces. 18256This applies even to non-numeric output formats. 18257This flag only has an effect when the field width is wider than the 18258value to be printed. 18259 18260@item @var{width} 18261The field should be padded to this width. The field is normally padded 18262with spaces. If the @samp{0} flag has been used, it is padded with zeros. 18263 18264@item .@var{prec} 18265A number that specifies the precision to use when printing. 18266For the @samp{e}, @samp{E}, and @samp{f} formats, this specifies the 18267number of digits you want printed to the right of the decimal point. 18268For the @samp{g}, and @samp{G} formats, it specifies the maximum number 18269of significant digits. For the @samp{d}, @samp{o}, @samp{i}, @samp{u}, 18270@samp{x}, and @samp{X} formats, it specifies the minimum number of 18271digits to print. For the @samp{s} format, it specifies the maximum number of 18272characters from the string that should be printed. 18273@end table 18274 18275Either or both of the @var{width} and @var{prec} values may be specified 18276as @samp{*}. In that case, the particular value is taken from the argument 18277list. 18278 18279@xref{Printf, ,Using @code{printf} Statements for Fancier Printing}. 18280 18281@node Special File Summary, Built-in Functions Summary, Printf Summary, Actions Summary 18282@appendixsubsec Special File Names 18283 18284When doing I/O redirection from either @code{print} or @code{printf} into a 18285file, or via @code{getline} from a file, @code{gawk} recognizes certain special 18286file names internally. These file names allow access to open file descriptors 18287inherited from @code{gawk}'s parent process (usually the shell). The 18288file names are: 18289 18290@table @file 18291@item /dev/stdin 18292The standard input. 18293 18294@item /dev/stdout 18295The standard output. 18296 18297@item /dev/stderr 18298The standard error output. 18299 18300@item /dev/fd/@var{n} 18301The file denoted by the open file descriptor @var{n}. 18302@end table 18303 18304In addition, reading the following files provides process related information 18305about the running @code{gawk} program. All returned records are terminated 18306with a newline. 18307 18308@table @file 18309@item /dev/pid 18310Returns the process ID of the current process. 18311 18312@item /dev/ppid 18313Returns the parent process ID of the current process. 18314 18315@item /dev/pgrpid 18316Returns the process group ID of the current process. 18317 18318@item /dev/user 18319At least four space-separated fields, containing the return values of 18320the @code{getuid}, @code{geteuid}, @code{getgid}, and @code{getegid} 18321system calls. 18322If there are any additional fields, they are the group IDs returned by 18323@code{getgroups} system call. 18324(Multiple groups may not be supported on all systems.) 18325@end table 18326 18327@noindent 18328These file names may also be used on the command line to name data files. 18329These file names are only recognized internally if you do not 18330actually have files with these names on your system. 18331 18332@xref{Special Files, ,Special File Names in @code{gawk}}, for a longer description that 18333provides the motivation for this feature. 18334 18335@node Built-in Functions Summary, Time Functions Summary, Special File Summary, Actions Summary 18336@appendixsubsec Built-in Functions 18337 18338@code{awk} provides a number of built-in functions for performing 18339numeric operations, string related operations, and I/O related operations. 18340 18341@c NEEDED 18342@page 18343The built-in arithmetic functions are: 18344 18345@table @code 18346@item atan2(@var{y}, @var{x}) 18347the arctangent of @var{y/x} in radians. 18348 18349@item cos(@var{expr}) 18350the cosine of @var{expr}, which is in radians. 18351 18352@item exp(@var{expr}) 18353the exponential function (@code{e ^ @var{expr}}). 18354 18355@item int(@var{expr}) 18356truncates to integer. 18357 18358@item log(@var{expr}) 18359the natural logarithm of @code{expr}. 18360 18361@item rand() 18362a random number between zero and one. 18363 18364@item sin(@var{expr}) 18365the sine of @var{expr}, which is in radians. 18366 18367@item sqrt(@var{expr}) 18368the square root function. 18369 18370@item srand(@r{[}@var{expr}@r{]}) 18371use @var{expr} as a new seed for the random number generator. If no @var{expr} 18372is provided, the time of day is used. The return value is the previous 18373seed for the random number generator. 18374@end table 18375 18376@code{awk} has the following built-in string functions: 18377 18378@table @code 18379@item gensub(@var{regex}, @var{subst}, @var{how} @r{[}, @var{target}@r{]}) 18380If @var{how} is a string beginning with @samp{g} or @samp{G}, then 18381replace each match of @var{regex} in @var{target} with @var{subst}. 18382Otherwise, replace the @var{how}'th occurrence. If @var{target} is not 18383supplied, use @code{$0}. The return value is the changed string; the 18384original @var{target} is not modified. Within @var{subst}, 18385@samp{\@var{n}}, where @var{n} is a digit from one to nine, can be used to 18386indicate the text that matched the @var{n}'th parenthesized 18387subexpression. 18388This function is @code{gawk}-specific. 18389 18390@item gsub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]}) 18391for each substring matching the regular expression @var{regex} in the string 18392@var{target}, substitute the string @var{subst}, and return the number of 18393substitutions. If @var{target} is not supplied, use @code{$0}. 18394 18395@item index(@var{str}, @var{search}) 18396returns the index of the string @var{search} in the string @var{str}, or 18397zero if 18398@var{search} is not present. 18399 18400@item length(@r{[}@var{str}@r{]}) 18401returns the length of the string @var{str}. The length of @code{$0} 18402is returned if no argument is supplied. 18403 18404@item match(@var{str}, @var{regex}) 18405returns the position in @var{str} where the regular expression @var{regex} 18406occurs, or zero if @var{regex} is not present, and sets the values of 18407@code{RSTART} and @code{RLENGTH}. 18408 18409@item split(@var{str}, @var{arr} @r{[}, @var{regex}@r{]}) 18410splits the string @var{str} into the array @var{arr} on the regular expression 18411@var{regex}, and returns the number of elements. If @var{regex} is omitted, 18412@code{FS} is used instead. @var{regex} can be the null string, causing 18413each character to be placed into its own array element. 18414The array @var{arr} is cleared first. 18415 18416@item sprintf(@var{fmt}, @var{expr-list}) 18417prints @var{expr-list} according to @var{fmt}, and returns the resulting string. 18418 18419@item sub(@var{regex}, @var{subst} @r{[}, @var{target}@r{]}) 18420just like @code{gsub}, but only the first matching substring is replaced. 18421 18422@item substr(@var{str}, @var{index} @r{[}, @var{len}@r{]}) 18423returns the @var{len}-character substring of @var{str} starting at @var{index}. 18424If @var{len} is omitted, the rest of @var{str} is used. 18425 18426@item tolower(@var{str}) 18427returns a copy of the string @var{str}, with all the upper-case characters in 18428@var{str} translated to their corresponding lower-case counterparts. 18429Non-alphabetic characters are left unchanged. 18430 18431@item toupper(@var{str}) 18432returns a copy of the string @var{str}, with all the lower-case characters in 18433@var{str} translated to their corresponding upper-case counterparts. 18434Non-alphabetic characters are left unchanged. 18435@end table 18436 18437The I/O related functions are: 18438 18439@table @code 18440@item close(@var{expr}) 18441Close the open file or pipe denoted by @var{expr}. 18442 18443@item fflush(@r{[}@var{expr}@r{]}) 18444Flush any buffered output for the output file or pipe denoted by @var{expr}. 18445If @var{expr} is omitted, standard output is flushed. 18446If @var{expr} is the null string (@code{""}), all output buffers are flushed. 18447 18448@item system(@var{cmd-line}) 18449Execute the command @var{cmd-line}, and return the exit status. 18450If your operating system does not support @code{system}, calling it will 18451generate a fatal error. 18452 18453@samp{system("")} can be used to force @code{awk} to flush any pending 18454output. This is more portable, but less obvious, than calling @code{fflush}. 18455@end table 18456 18457@node Time Functions Summary, String Constants Summary, Built-in Functions Summary, Actions Summary 18458@appendixsubsec Time Functions 18459 18460The following two functions are available for getting the current 18461time of day, and for formatting time stamps. 18462They are specific to @code{gawk}. 18463 18464@table @code 18465@item systime() 18466returns the current time of day as the number of seconds since a particular 18467epoch (Midnight, January 1, 1970 UTC, on POSIX systems). 18468 18469@item strftime(@r{[}@var{format}@r{[}, @var{timestamp}@r{]]}) 18470formats @var{timestamp} according to the specification in @var{format}. 18471The current time of day is used if no @var{timestamp} is supplied. 18472A default format equivalent to the output of the @code{date} utility is used if 18473no @var{format} is supplied. 18474@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for the 18475details on the conversion specifiers that @code{strftime} accepts. 18476@end table 18477 18478@iftex 18479@xref{Built-in, ,Built-in Functions}, for a description of all of 18480@code{awk}'s built-in functions. 18481@end iftex 18482 18483@node String Constants Summary, , Time Functions Summary, Actions Summary 18484@appendixsubsec String Constants 18485 18486String constants in @code{awk} are sequences of characters enclosed 18487in double quotes (@code{"}). Within strings, certain @dfn{escape sequences} 18488are recognized, as in C. These are: 18489 18490@table @code 18491@item \\ 18492A literal backslash. 18493 18494@item \a 18495The ``alert'' character; usually the ASCII BEL character. 18496 18497@item \b 18498Backspace. 18499 18500@item \f 18501Formfeed. 18502 18503@item \n 18504Newline. 18505 18506@item \r 18507Carriage return. 18508 18509@item \t 18510Horizontal tab. 18511 18512@item \v 18513Vertical tab. 18514 18515@item \x@var{hex digits} 18516The character represented by the string of hexadecimal digits following 18517the @samp{\x}. As in ANSI C, all following hexadecimal digits are 18518considered part of the escape sequence. E.g., @code{"\x1B"} is a 18519string containing the ASCII ESC (escape) character. (The @samp{\x} 18520escape sequence is not in POSIX @code{awk}.) 18521 18522@item \@var{ddd} 18523The character represented by the one, two, or three digit sequence of octal 18524digits. Thus, @code{"\033"} is also a string containing the ASCII ESC 18525(escape) character. 18526 18527@item \@var{c} 18528The literal character @var{c}, if @var{c} is not one of the above. 18529@end table 18530 18531The escape sequences may also be used inside constant regular expressions 18532(e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace 18533characters). 18534 18535@xref{Escape Sequences}. 18536 18537@node Functions Summary, Historical Features, Actions Summary, Gawk Summary 18538@appendixsec User-defined Functions 18539 18540Functions in @code{awk} are defined as follows: 18541 18542@example 18543function @var{name}(@var{parameter list}) @{ @var{statements} @} 18544@end example 18545 18546Actual parameters supplied in the function call are used to instantiate 18547the formal parameters declared in the function. Arrays are passed by 18548reference, other variables are passed by value. 18549 18550If there are fewer arguments passed than there are names in @var{parameter-list}, 18551the extra names are given the null string as their value. Extra names have the 18552effect of local variables. 18553 18554The open-parenthesis in a function call of a user-defined function must 18555immediately follow the function name, without any intervening white space. 18556This is to avoid a syntactic ambiguity with the concatenation operator. 18557 18558The word @code{func} may be used in place of @code{function} (but not in 18559POSIX @code{awk}). 18560 18561Use the @code{return} statement to return a value from a function. 18562 18563@xref{User-defined, ,User-defined Functions}. 18564 18565@node Historical Features, , Functions Summary, Gawk Summary 18566@appendixsec Historical Features 18567 18568@cindex historical features 18569There are two features of historical @code{awk} implementations that 18570@code{gawk} supports. 18571 18572First, it is possible to call the @code{length} built-in function not only 18573with no arguments, but even without parentheses! 18574 18575@example 18576a = length 18577@end example 18578 18579@noindent 18580is the same as either of 18581 18582@example 18583a = length() 18584a = length($0) 18585@end example 18586 18587@noindent 18588For example: 18589 18590@example 18591$ echo abcdef | awk '@{ print length @}' 18592@print{} 6 18593@end example 18594 18595@noindent 18596This feature is marked as ``deprecated'' in the POSIX standard, and 18597@code{gawk} will issue a warning about its use if @samp{--lint} is 18598specified on the command line. 18599(The ability to use @code{length} this way was actually an accident of the 18600original Unix @code{awk} implementation. If any built-in function used 18601@code{$0} as its default argument, it was possible to call that function 18602without the parentheses. In particular, it was common practice to use 18603the @code{length} function in this fashion, and this usage was documented 18604in the @code{awk} manual page.) 18605 18606The other historical feature is the use of either the @code{break} statement, 18607or the @code{continue} statement 18608outside the body of a @code{while}, @code{for}, or @code{do} loop. Traditional 18609@code{awk} implementations have treated such usage as equivalent to the 18610@code{next} statement. More recent versions of Unix @code{awk} do not allow 18611it. @code{gawk} supports this usage if @samp{--traditional} has been 18612specified. 18613 18614@xref{Options, ,Command Line Options}, for more information about the 18615@samp{--posix} and @samp{--lint} options. 18616 18617@node Installation, Notes, Gawk Summary, Top 18618@appendix Installing @code{gawk} 18619 18620This appendix provides instructions for installing @code{gawk} on the 18621various platforms that are supported by the developers. The primary 18622developers support Unix (and one day, GNU), while the other ports were 18623contributed. The file @file{ACKNOWLEDGMENT} in the @code{gawk} 18624distribution lists the electronic mail addresses of the people who did 18625the respective ports, and they are also provided in 18626@ref{Bugs, , Reporting Problems and Bugs}. 18627 18628@menu 18629* Gawk Distribution:: What is in the @code{gawk} distribution. 18630* Unix Installation:: Installing @code{gawk} under various versions 18631 of Unix. 18632* VMS Installation:: Installing @code{gawk} on VMS. 18633* PC Installation:: Installing and Compiling @code{gawk} on MS-DOS 18634 and OS/2 18635* Atari Installation:: Installing @code{gawk} on the Atari ST. 18636* Amiga Installation:: Installing @code{gawk} on an Amiga. 18637* Bugs:: Reporting Problems and Bugs. 18638* Other Versions:: Other freely available @code{awk} 18639 implementations. 18640@end menu 18641 18642@node Gawk Distribution, Unix Installation, Installation, Installation 18643@appendixsec The @code{gawk} Distribution 18644 18645This section first describes how to get the @code{gawk} 18646distribution, how to extract it, and then what is in the various files and 18647subdirectories. 18648 18649@menu 18650* Getting:: How to get the distribution. 18651* Extracting:: How to extract the distribution. 18652* Distribution contents:: What is in the distribution. 18653@end menu 18654 18655@node Getting, Extracting, Gawk Distribution, Gawk Distribution 18656@appendixsubsec Getting the @code{gawk} Distribution 18657@cindex getting @code{gawk} 18658@cindex anonymous @code{ftp} 18659@cindex @code{ftp}, anonymous 18660@cindex Free Software Foundation 18661There are three ways you can get GNU software. 18662 18663@enumerate 18664@item 18665You can copy it from someone else who already has it. 18666 18667@cindex Free Software Foundation 18668@item 18669You can order @code{gawk} directly from the Free Software Foundation. 18670Software distributions are available for Unix, MS-DOS, and VMS, on 18671tape and CD-ROM. The address is: 18672 18673@quotation 18674Free Software Foundation @* 1867559 Temple Place---Suite 330 @* 18676Boston, MA 02111-1307 USA @* 18677Phone: +1-617-542-5942 @* 18678Fax (including Japan): +1-617-542-2652 @* 18679Email: @code{gnu@@gnu.org} @* 18680URL: @code{http://www.gnu.org/} @* 18681@end quotation 18682 18683@noindent 18684Ordering from the FSF directly contributes to the support of the foundation 18685and to the production of more free software. 18686 18687@item 18688You can get @code{gawk} by using anonymous @code{ftp} to the Internet host 18689@code{gnudist.gnu.org}, in the directory @file{/gnu/gawk}. 18690 18691Here is a list of alternate @code{ftp} sites from which you can obtain GNU 18692software. When a site is listed as ``@var{site}@code{:}@var{directory}'' the 18693@var{directory} indicates the directory where GNU software is kept. 18694You should use a site that is geographically close to you. 18695 18696@table @asis 18697@item Asia: 18698@table @code 18699@item cair-archive.kaist.ac.kr:/pub/gnu 18700@itemx ftp.cs.titech.ac.jp 18701@itemx ftp.nectec.or.th:/pub/mirrors/gnu 18702@itemx utsun.s.u-tokyo.ac.jp:/ftpsync/prep 18703@end table 18704 18705@c NEEDED 18706@page 18707@item Australia: 18708@table @code 18709@item archie.au:/gnu 18710(@code{archie.oz} or @code{archie.oz.au} for ACSnet) 18711@end table 18712 18713@item Africa: 18714@table @code 18715@item ftp.sun.ac.za:/pub/gnu 18716@end table 18717 18718@item Middle East: 18719@table @code 18720@item ftp.technion.ac.il:/pub/unsupported/gnu 18721@end table 18722 18723@item Europe: 18724@table @code 18725@item archive.eu.net 18726@itemx ftp.denet.dk 18727@itemx ftp.eunet.ch 18728@itemx ftp.funet.fi:/pub/gnu 18729@itemx ftp.ieunet.ie:pub/gnu 18730@itemx ftp.informatik.rwth-aachen.de:/pub/gnu 18731@itemx ftp.informatik.tu-muenchen.de 18732@itemx ftp.luth.se:/pub/unix/gnu 18733@itemx ftp.mcc.ac.uk 18734@itemx ftp.stacken.kth.se 18735@itemx ftp.sunet.se:/pub/gnu 18736@itemx ftp.univ-lyon1.fr:pub/gnu 18737@itemx ftp.win.tue.nl:/pub/gnu 18738@itemx irisa.irisa.fr:/pub/gnu 18739@itemx isy.liu.se 18740@itemx nic.switch.ch:/mirror/gnu 18741@itemx src.doc.ic.ac.uk:/gnu 18742@itemx unix.hensa.ac.uk:/pub/uunet/systems/gnu 18743@end table 18744 18745@item South America: 18746@table @code 18747@item ftp.inf.utfsm.cl:/pub/gnu 18748@itemx ftp.unicamp.br:/pub/gnu 18749@end table 18750 18751@item Western Canada: 18752@table @code 18753@item ftp.cs.ubc.ca:/mirror2/gnu 18754@end table 18755 18756@item USA: 18757@table @code 18758@item col.hp.com:/mirrors/gnu 18759@itemx f.ms.uky.edu:/pub3/gnu 18760@itemx ftp.cc.gatech.edu:/pub/gnu 18761@itemx ftp.cs.columbia.edu:/archives/gnu/prep 18762@itemx ftp.digex.net:/pub/gnu 18763@itemx ftp.hawaii.edu:/mirrors/gnu 18764@itemx ftp.kpc.com:/pub/mirror/gnu 18765@end table 18766 18767@c NEEDED 18768@page 18769@item USA (continued): 18770@table @code 18771@itemx ftp.uu.net:/systems/gnu 18772@itemx gatekeeper.dec.com:/pub/GNU 18773@itemx jaguar.utah.edu:/gnustuff 18774@itemx labrea.stanford.edu 18775@itemx mrcnext.cso.uiuc.edu:/pub/gnu 18776@itemx vixen.cso.uiuc.edu:/gnu 18777@itemx wuarchive.wustl.edu:/systems/gnu 18778@end table 18779@end table 18780@end enumerate 18781 18782@node Extracting, Distribution contents, Getting, Gawk Distribution 18783@appendixsubsec Extracting the Distribution 18784@code{gawk} is distributed as a @code{tar} file compressed with the 18785GNU Zip program, @code{gzip}. 18786 18787Once you have the distribution (for example, 18788@file{gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz}), first use @code{gzip} to expand the 18789file, and then use @code{tar} to extract it. You can use the following 18790pipeline to produce the @code{gawk} distribution: 18791 18792@example 18793# Under System V, add 'o' to the tar flags 18794gzip -d -c gawk-@value{VERSION}.@value{PATCHLEVEL}.tar.gz | tar -xvpf - 18795@end example 18796 18797@noindent 18798This will create a directory named @file{gawk-@value{VERSION}.@value{PATCHLEVEL}} in the current 18799directory. 18800 18801The distribution file name is of the form 18802@file{gawk-@var{V}.@var{R}.@var{n}.tar.gz}. 18803The @var{V} represents the major version of @code{gawk}, 18804the @var{R} represents the current release of version @var{V}, and 18805the @var{n} represents a @dfn{patch level}, meaning that minor bugs have 18806been fixed in the release. The current patch level is @value{PATCHLEVEL}, 18807but when 18808retrieving distributions, you should get the version with the highest 18809version, release, and patch level. (Note that release levels greater than 18810or equal to 90 denote ``beta,'' or non-production software; you may not wish 18811to retrieve such a version unless you don't mind experimenting.) 18812 18813If you are not on a Unix system, you will need to make other arrangements 18814for getting and extracting the @code{gawk} distribution. You should consult 18815a local expert. 18816 18817@node Distribution contents, , Extracting, Gawk Distribution 18818@appendixsubsec Contents of the @code{gawk} Distribution 18819 18820The @code{gawk} distribution has a number of C source files, 18821documentation files, 18822subdirectories and files related to the configuration process 18823(@pxref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}), 18824and several subdirectories related to different, non-Unix, 18825operating systems. 18826 18827@table @asis 18828@item various @samp{.c}, @samp{.y}, and @samp{.h} files 18829These files are the actual @code{gawk} source code. 18830@end table 18831 18832@table @file 18833@item README 18834@itemx README_d/README.* 18835Descriptive files: @file{README} for @code{gawk} under Unix, and the 18836rest for the various hardware and software combinations. 18837 18838@item INSTALL 18839A file providing an overview of the configuration and installation process. 18840 18841@item PORTS 18842A list of systems to which @code{gawk} has been ported, and which 18843have successfully run the test suite. 18844 18845@item ACKNOWLEDGMENT 18846A list of the people who contributed major parts of the code or documentation. 18847 18848@item ChangeLog 18849A detailed list of source code changes as bugs are fixed or improvements made. 18850 18851@item NEWS 18852A list of changes to @code{gawk} since the last release or patch. 18853 18854@item COPYING 18855The GNU General Public License. 18856 18857@item FUTURES 18858A brief list of features and/or changes being contemplated for future 18859releases, with some indication of the time frame for the feature, based 18860on its difficulty. 18861 18862@item LIMITATIONS 18863A list of those factors that limit @code{gawk}'s performance. 18864Most of these depend on the hardware or operating system software, and 18865are not limits in @code{gawk} itself. 18866 18867@item POSIX.STD 18868A description of one area where the POSIX standard for @code{awk} is 18869incorrect, and how @code{gawk} handles the problem. 18870 18871@item PROBLEMS 18872A file describing known problems with the current release. 18873 18874@cindex artificial intelligence, using @code{gawk} 18875@cindex AI programming, using @code{gawk} 18876@item doc/awkforai.txt 18877A short article describing why @code{gawk} is a good language for 18878AI (Artificial Intelligence) programming. 18879 18880@item doc/README.card 18881@itemx doc/ad.block 18882@itemx doc/awkcard.in 18883@itemx doc/cardfonts 18884@itemx doc/colors 18885@itemx doc/macros 18886@itemx doc/no.colors 18887@itemx doc/setter.outline 18888The @code{troff} source for a five-color @code{awk} reference card. 18889A modern version of @code{troff}, such as GNU Troff (@code{groff}) is 18890needed to produce the color version. See the file @file{README.card} 18891for instructions if you have an older @code{troff}. 18892 18893@item doc/gawk.1 18894The @code{troff} source for a manual page describing @code{gawk}. 18895This is distributed for the convenience of Unix users. 18896 18897@item doc/gawk.texi 18898The Texinfo source file for this @value{DOCUMENT}. 18899It should be processed with @TeX{} to produce a printed document, and 18900with @code{makeinfo} to produce an Info file. 18901 18902@item doc/gawk.info 18903The generated Info file for this @value{DOCUMENT}. 18904 18905@item doc/igawk.1 18906The @code{troff} source for a manual page describing the @code{igawk} 18907program presented in 18908@ref{Igawk Program, ,An Easy Way to Use Library Functions}. 18909 18910@item doc/Makefile.in 18911The input file used during the configuration process to generate the 18912actual @file{Makefile} for creating the documentation. 18913 18914@item Makefile.in 18915@itemx acconfig.h 18916@itemx aclocal.m4 18917@itemx configh.in 18918@itemx configure.in 18919@itemx configure 18920@itemx custom.h 18921@itemx missing/* 18922These files and subdirectory are used when configuring @code{gawk} 18923for various Unix systems. They are explained in detail in 18924@ref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}. 18925 18926@item awklib/extract.awk 18927@itemx awklib/Makefile.in 18928The @file{awklib} directory contains a copy of @file{extract.awk} 18929(@pxref{Extract Program, ,Extracting Programs from Texinfo Source Files}), 18930which can be used to extract the sample programs from the Texinfo 18931source file for this @value{DOCUMENT}, and a @file{Makefile.in} file, which 18932@code{configure} uses to generate a @file{Makefile}. 18933As part of the process of building @code{gawk}, the library functions from 18934@ref{Library Functions, , A Library of @code{awk} Functions}, 18935and the @code{igawk} program from 18936@ref{Igawk Program, , An Easy Way to Use Library Functions}, 18937are extracted into ready to use files. 18938They are installed as part of the installation process. 18939 18940@item atari/* 18941Files needed for building @code{gawk} on an Atari ST. 18942@xref{Atari Installation, ,Installing @code{gawk} on the Atari ST}, for details. 18943 18944@item pc/* 18945Files needed for building @code{gawk} under MS-DOS and OS/2. 18946@xref{PC Installation, ,MS-DOS and OS/2 Installation and Compilation}, for details. 18947 18948@item vms/* 18949Files needed for building @code{gawk} under VMS. 18950@xref{VMS Installation, ,How to Compile and Install @code{gawk} on VMS}, for details. 18951 18952@item test/* 18953A test suite for 18954@code{gawk}. You can use @samp{make check} from the top level @code{gawk} 18955directory to run your version of @code{gawk} against the test suite. 18956If @code{gawk} successfully passes @samp{make check} then you can 18957be confident of a successful port. 18958@end table 18959 18960@node Unix Installation, VMS Installation, Gawk Distribution, Installation 18961@appendixsec Compiling and Installing @code{gawk} on Unix 18962 18963Usually, you can compile and install @code{gawk} by typing only two 18964commands. However, if you do use an unusual system, you may need 18965to configure @code{gawk} for your system yourself. 18966 18967@menu 18968* Quick Installation:: Compiling @code{gawk} under Unix. 18969* Configuration Philosophy:: How it's all supposed to work. 18970@end menu 18971 18972@node Quick Installation, Configuration Philosophy, Unix Installation, Unix Installation 18973@appendixsubsec Compiling @code{gawk} for Unix 18974 18975@cindex installation, unix 18976After you have extracted the @code{gawk} distribution, @code{cd} 18977to @file{gawk-@value{VERSION}.@value{PATCHLEVEL}}. Like most GNU software, 18978@code{gawk} is configured 18979automatically for your Unix system by running the @code{configure} program. 18980This program is a Bourne shell script that was generated automatically using 18981GNU @code{autoconf}. 18982@iftex 18983(The @code{autoconf} software is 18984described fully in 18985@cite{Autoconf---Generating Automatic Configuration Scripts}, 18986which is available from the Free Software Foundation.) 18987@end iftex 18988@ifinfo 18989(The @code{autoconf} software is described fully starting with 18990@ref{Top, , Introduction, autoconf, Autoconf---Generating Automatic Configuration Scripts}.) 18991@end ifinfo 18992 18993To configure @code{gawk}, simply run @code{configure}: 18994 18995@example 18996sh ./configure 18997@end example 18998 18999This produces a @file{Makefile} and @file{config.h} tailored to your system. 19000The @file{config.h} file describes various facts about your system. 19001You may wish to edit the @file{Makefile} to 19002change the @code{CFLAGS} variable, which controls 19003the command line options that are passed to the C compiler (such as 19004optimization levels, or compiling for debugging). 19005 19006Alternatively, you can add your own values for most @code{make} 19007variables, such as @code{CC} and @code{CFLAGS}, on the command line when 19008running @code{configure}: 19009 19010@example 19011CC=cc CFLAGS=-g sh ./configure 19012@end example 19013 19014@noindent 19015See the file @file{INSTALL} in the @code{gawk} distribution for 19016all the details. 19017 19018After you have run @code{configure}, and possibly edited the @file{Makefile}, 19019type: 19020 19021@example 19022make 19023@end example 19024 19025@noindent 19026and shortly thereafter, you should have an executable version of @code{gawk}. 19027That's all there is to it! 19028(If these steps do not work, please send in a bug report; 19029@pxref{Bugs, ,Reporting Problems and Bugs}.) 19030 19031@node Configuration Philosophy, , Quick Installation, Unix Installation 19032@appendixsubsec The Configuration Process 19033 19034@cindex configuring @code{gawk} 19035(This section is of interest only if you know something about using the 19036C language and the Unix operating system.) 19037 19038The source code for @code{gawk} generally attempts to adhere to formal 19039standards wherever possible. This means that @code{gawk} uses library 19040routines that are specified by the ANSI C standard and by the POSIX 19041operating system interface standard. When using an ANSI C compiler, 19042function prototypes are used to help improve the compile-time checking. 19043 19044Many Unix systems do not support all of either the ANSI or the 19045POSIX standards. The @file{missing} subdirectory in the @code{gawk} 19046distribution contains replacement versions of those subroutines that are 19047most likely to be missing. 19048 19049The @file{config.h} file that is created by the @code{configure} program 19050contains definitions that describe features of the particular operating 19051system where you are attempting to compile @code{gawk}. The three things 19052described by this file are what header files are available, so that 19053they can be correctly included, 19054what (supposedly) standard functions are actually available in your C 19055libraries, and 19056other miscellaneous facts about your 19057variant of Unix. For example, there may not be an @code{st_blksize} 19058element in the @code{stat} structure. In this case @samp{HAVE_ST_BLKSIZE} 19059would be undefined. 19060 19061@cindex @code{custom.h} configuration file 19062It is possible for your C compiler to lie to @code{configure}. It may 19063do so by not exiting with an error when a library function is not 19064available. To get around this, you can edit the file @file{custom.h}. 19065Use an @samp{#ifdef} that is appropriate for your system, and either 19066@code{#define} any constants that @code{configure} should have defined but 19067didn't, or @code{#undef} any constants that @code{configure} defined and 19068should not have. @file{custom.h} is automatically included by 19069@file{config.h}. 19070 19071It is also possible that the @code{configure} program generated by 19072@code{autoconf} 19073will not work on your system in some other fashion. If you do have a problem, 19074the file 19075@file{configure.in} is the input for @code{autoconf}. You may be able to 19076change this file, and generate a new version of @code{configure} that will 19077work on your system. @xref{Bugs, ,Reporting Problems and Bugs}, for 19078information on how to report problems in configuring @code{gawk}. The same 19079mechanism may be used to send in updates to @file{configure.in} and/or 19080@file{custom.h}. 19081 19082@node VMS Installation, PC Installation, Unix Installation, Installation 19083@appendixsec How to Compile and Install @code{gawk} on VMS 19084 19085@c based on material from Pat Rankin <rankin@eql.caltech.edu> 19086 19087@cindex installation, vms 19088This section describes how to compile and install @code{gawk} under VMS. 19089 19090@menu 19091* VMS Compilation:: How to compile @code{gawk} under VMS. 19092* VMS Installation Details:: How to install @code{gawk} under VMS. 19093* VMS Running:: How to run @code{gawk} under VMS. 19094* VMS POSIX:: Alternate instructions for VMS POSIX. 19095@end menu 19096 19097@node VMS Compilation, VMS Installation Details, VMS Installation, VMS Installation 19098@appendixsubsec Compiling @code{gawk} on VMS 19099 19100To compile @code{gawk} under VMS, there is a @code{DCL} command procedure that 19101will issue all the necessary @code{CC} and @code{LINK} commands, and there is 19102also a @file{Makefile} for use with the @code{MMS} utility. From the source 19103directory, use either 19104 19105@example 19106$ @@[.VMS]VMSBUILD.COM 19107@end example 19108 19109@noindent 19110or 19111 19112@example 19113$ MMS/DESCRIPTION=[.VMS]DESCRIP.MMS GAWK 19114@end example 19115 19116Depending upon which C compiler you are using, follow one of the sets 19117of instructions in this table: 19118 19119@table @asis 19120@item VAX C V3.x 19121Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use 19122@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0. 19123 19124@item VAX C V2.x 19125You must have Version 2.3 or 2.4; older ones won't work. Edit either 19126@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them. 19127For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters. 19128Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h}) 19129and comment out or delete the two lines @samp{#define __STDC__ 0} and 19130@samp{#define VAXC_BUILTINS} near the end. 19131 19132@item GNU C 19133Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different 19134from those for VAX C V2.x, but equally straightforward. No changes to 19135@file{config.h} should be needed. 19136 19137@item DEC C 19138Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments. 19139No changes to @file{config.h} should be needed. 19140@end table 19141 19142@code{gawk} has been tested under VAX/VMS 5.5-1 using VAX C V3.2, 19143GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up. 19144 19145@node VMS Installation Details, VMS Running, VMS Compilation, VMS Installation 19146@appendixsubsec Installing @code{gawk} on VMS 19147 19148To install @code{gawk}, all you need is a ``foreign'' command, which is 19149a @code{DCL} symbol whose value begins with a dollar sign. For example: 19150 19151@example 19152$ GAWK :== $disk1:[gnubin]GAWK 19153@end example 19154 19155@noindent 19156(Substitute the actual location of @code{gawk.exe} for 19157@samp{$disk1:[gnubin]}.) The symbol should be placed in the 19158@file{login.com} of any user who wishes to run @code{gawk}, 19159so that it will be defined every time the user logs on. 19160Alternatively, the symbol may be placed in the system-wide 19161@file{sylogin.com} procedure, which will allow all users 19162to run @code{gawk}. 19163 19164Optionally, the help entry can be loaded into a VMS help library: 19165 19166@example 19167$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP 19168@end example 19169 19170@noindent 19171(You may want to substitute a site-specific help library rather than 19172the standard VMS library @samp{HELPLIB}.) After loading the help text, 19173 19174@example 19175$ HELP GAWK 19176@end example 19177 19178@noindent 19179will provide information about both the @code{gawk} implementation and the 19180@code{awk} programming language. 19181 19182The logical name @samp{AWK_LIBRARY} can designate a default location 19183for @code{awk} program files. For the @samp{-f} option, if the specified 19184filename has no device or directory path information in it, @code{gawk} 19185will look in the current directory first, then in the directory specified 19186by the translation of @samp{AWK_LIBRARY} if the file was not found. 19187If after searching in both directories, the file still is not found, 19188then @code{gawk} appends the suffix @samp{.awk} to the filename and the 19189file search will be re-tried. If @samp{AWK_LIBRARY} is not defined, that 19190portion of the file search will fail benignly. 19191 19192@node VMS Running, VMS POSIX, VMS Installation Details, VMS Installation 19193@appendixsubsec Running @code{gawk} on VMS 19194 19195Command line parsing and quoting conventions are significantly different 19196on VMS, so examples in this @value{DOCUMENT} or from other sources often need minor 19197changes. They @emph{are} minor though, and all @code{awk} programs 19198should run correctly. 19199 19200Here are a couple of trivial tests: 19201 19202@example 19203$ gawk -- "BEGIN @{print ""Hello, World!""@}" 19204$ gawk -"W" version 19205! could also be -"W version" or "-W version" 19206@end example 19207 19208@noindent 19209Note that upper-case and mixed-case text must be quoted. 19210 19211The VMS port of @code{gawk} includes a @code{DCL}-style interface in addition 19212to the original shell-style interface (see the help entry for details). 19213One side-effect of dual command line parsing is that if there is only a 19214single parameter (as in the quoted string program above), the command 19215becomes ambiguous. To work around this, the normally optional @samp{--} 19216flag is required to force Unix style rather than @code{DCL} parsing. If any 19217other dash-type options (or multiple parameters such as data files to be 19218processed) are present, there is no ambiguity and @samp{--} can be omitted. 19219 19220The default search path when looking for @code{awk} program files specified 19221by the @samp{-f} option is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical 19222name @samp{AWKPATH} can be used to override this default. The format 19223of @samp{AWKPATH} is a comma-separated list of directory specifications. 19224When defining it, the value should be quoted so that it retains a single 19225translation, and not a multi-translation @code{RMS} searchlist. 19226 19227@node VMS POSIX, , VMS Running, VMS Installation 19228@appendixsubsec Building and Using @code{gawk} on VMS POSIX 19229 19230Ignore the instructions above, although @file{vms/gawk.hlp} should still 19231be made available in a help library. The source tree should be unpacked 19232into a container file subsystem rather than into the ordinary VMS file 19233system. Make sure that the two scripts, @file{configure} and 19234@file{vms/posix-cc.sh}, are executable; use @samp{chmod +x} on them if 19235necessary. Then execute the following two commands: 19236 19237@example 19238@group 19239psx> CC=vms/posix-cc.sh configure 19240psx> make CC=c89 gawk 19241@end group 19242@end example 19243 19244@noindent 19245The first command will construct files @file{config.h} and @file{Makefile} out 19246of templates, using a script to make the C compiler fit @code{configure}'s 19247expectations. The second command will compile and link @code{gawk} using 19248the C compiler directly; ignore any warnings from @code{make} about being 19249unable to redefine @code{CC}. @code{configure} will take a very long 19250time to execute, but at least it provides incremental feedback as it 19251runs. 19252 19253This has been tested with VAX/VMS V6.2, VMS POSIX V2.0, and DEC C V5.2. 19254 19255Once built, @code{gawk} will work like any other shell utility. Unlike 19256the normal VMS port of @code{gawk}, no special command line manipulation is 19257needed in the VMS POSIX environment. 19258 19259@c Rewritten by Scott Deifik <scottd@amgen.com> 19260@c and Darrel Hankerson <hankedr@mail.auburn.edu> 19261@node PC Installation, Atari Installation, VMS Installation, Installation 19262@appendixsec MS-DOS and OS/2 Installation and Compilation 19263 19264@cindex installation, MS-DOS and OS/2 19265If you have received a binary distribution prepared by the DOS 19266maintainers, then @code{gawk} and the necessary support files will appear 19267under the @file{gnu} directory, with executables in @file{gnu/bin}, 19268libraries in @file{gnu/lib/awk}, and manual pages under @file{gnu/man}. 19269This is designed for easy installation to a @file{/gnu} directory on your 19270drive, but the files can be installed anywhere provided @code{AWKPATH} is 19271set properly. Regardless of the installation directory, the first line of 19272@file{igawk.cmd} and @file{igawk.bat} (in @file{gnu/bin}) may need to be 19273edited. 19274 19275The binary distribution will contain a separate file describing the 19276contents. In particular, it may include more than one version of the 19277@code{gawk} executable. OS/2 binary distributions may have a 19278different arrangement, but installation is similar. 19279 19280The OS/2 and MS-DOS versions of @code{gawk} search for program files as 19281described in @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}. 19282However, semicolons (rather than colons) separate elements 19283in the @code{AWKPATH} variable. If @code{AWKPATH} is not set or is empty, 19284then the default search path is @code{@w{".;c:/lib/awk;c:/gnu/lib/awk"}}. 19285 19286An @code{sh}-like shell (as opposed to @code{command.com} under MS-DOS 19287or @code{cmd.exe} under OS/2) may be useful for @code{awk} programming. 19288Ian Stewartson has written an excellent shell for MS-DOS and OS/2, and a 19289@code{ksh} clone and GNU Bash are available for OS/2. The file 19290@file{README_d/README.pc} in the @code{gawk} distribution contains 19291information on these shells. Users of Stewartson's shell on DOS should 19292examine its documentation on handling of command-lines. In particular, 19293the setting for @code{gawk} in the shell configuration may need to be 19294changed, and the @code{ignoretype} option may also be of interest. 19295 19296@code{gawk} can be compiled for MS-DOS and OS/2 using the GNU development tools 19297from DJ Delorie (DJGPP, MS-DOS-only) or Eberhard Mattes (EMX, MS-DOS and OS/2). 19298Microsoft C can be used to build 16-bit versions for MS-DOS and OS/2. The file 19299@file{README_d/README.pc} in the @code{gawk} distribution contains additional 19300notes, and @file{pc/Makefile} contains important notes on compilation options. 19301 19302To build @code{gawk}, copy the files in the @file{pc} directory (@emph{except} 19303for @file{ChangeLog}) to the 19304directory with the rest of the @code{gawk} sources. The @file{Makefile} 19305contains a configuration section with comments, and may need to be 19306edited in order to work with your @code{make} utility. 19307 19308The @file{Makefile} contains a number of targets for building various MS-DOS 19309and OS/2 versions. A list of targets will be printed if the @code{make} 19310command is given without a target. As an example, to build @code{gawk} 19311using the DJGPP tools, enter @samp{make djgpp}. 19312 19313Using @code{make} to run the standard tests and to install @code{gawk} 19314requires additional Unix-like tools, including @code{sh}, @code{sed}, and 19315@code{cp}. In order to run the tests, the @file{test/*.ok} files may need to 19316be converted so that they have the usual DOS-style end-of-line markers. Most 19317of the tests will work properly with Stewartson's shell along with the 19318companion utilities or appropriate GNU utilities. However, some editing of 19319@file{test/Makefile} is required. It is recommended that the file 19320@file{pc/Makefile.tst} be copied to @file{test/Makefile} as a 19321replacement. Details can be found in @file{README_d/README.pc}. 19322 19323@node Atari Installation, Amiga Installation, PC Installation, Installation 19324@appendixsec Installing @code{gawk} on the Atari ST 19325 19326@c based on material from Michal Jaegermann <michal@gortel.phys.ualberta.ca> 19327 19328@cindex atari 19329@cindex installation, atari 19330There are no substantial differences when installing @code{gawk} on 19331various Atari models. Compiled @code{gawk} executables do not require 19332a large amount of memory with most @code{awk} programs and should run on all 19333Motorola processor based models (called further ST, even if that is not 19334exactly right). 19335 19336In order to use @code{gawk}, you need to have a shell, either text or 19337graphics, that does not map all the characters of a command line to 19338upper-case. Maintaining case distinction in option flags is very 19339important (@pxref{Options, ,Command Line Options}). 19340These days this is the default, and it may only be a problem for some 19341very old machines. If your system does not preserve the case of option 19342flags, you will need to upgrade your tools. Support for I/O 19343redirection is necessary to make it easy to import @code{awk} programs 19344from other environments. Pipes are nice to have, but not vital. 19345 19346@menu 19347* Atari Compiling:: Compiling @code{gawk} on Atari 19348* Atari Using:: Running @code{gawk} on Atari 19349@end menu 19350 19351@node Atari Compiling, Atari Using, Atari Installation, Atari Installation 19352@appendixsubsec Compiling @code{gawk} on the Atari ST 19353 19354A proper compilation of @code{gawk} sources when @code{sizeof(int)} 19355differs from @code{sizeof(void *)} requires an ANSI C compiler. An initial 19356port was done with @code{gcc}. You may actually prefer executables 19357where @code{int}s are four bytes wide, but the other variant works as well. 19358 19359You may need quite a bit of memory when trying to recompile the @code{gawk} 19360sources, as some source files (@file{regex.c} in particular) are quite 19361big. If you run out of memory compiling such a file, try reducing the 19362optimization level for this particular file; this may help. 19363 19364@cindex Linux 19365With a reasonable shell (Bash will do), and in particular if you run 19366Linux, MiNT or a similar operating system, you have a pretty good 19367chance that the @code{configure} utility will succeed. Otherwise 19368sample versions of @file{config.h} and @file{Makefile.st} are given in the 19369@file{atari} subdirectory and can be edited and copied to the 19370corresponding files in the main source directory. Even if 19371@code{configure} produced something, it might be advisable to compare 19372its results with the sample versions and possibly make adjustments. 19373 19374Some @code{gawk} source code fragments depend on a preprocessor define 19375@samp{atarist}. This basically assumes the TOS environment with @code{gcc}. 19376Modify these sections as appropriate if they are not right for your 19377environment. Also see the remarks about @code{AWKPATH} and @code{envsep} in 19378@ref{Atari Using, ,Running @code{gawk} on the Atari ST}. 19379 19380As shipped, the sample @file{config.h} claims that the @code{system} 19381function is missing from the libraries, which is not true, and an 19382alternative implementation of this function is provided in 19383@file{atari/system.c}. Depending upon your particular combination of 19384shell and operating system, you may wish to change the file to indicate 19385that @code{system} is available. 19386 19387@node Atari Using, , Atari Compiling, Atari Installation 19388@appendixsubsec Running @code{gawk} on the Atari ST 19389 19390An executable version of @code{gawk} should be placed, as usual, 19391anywhere in your @code{PATH} where your shell can find it. 19392 19393While executing, @code{gawk} creates a number of temporary files. When 19394using @code{gcc} libraries for TOS, @code{gawk} looks for either of 19395the environment variables @code{TEMP} or @code{TMPDIR}, in that order. 19396If either one is found, its value is assumed to be a directory for 19397temporary files. This directory must exist, and if you can spare the 19398memory, it is a good idea to put it on a RAM drive. If neither 19399@code{TEMP} nor @code{TMPDIR} are found, then @code{gawk} uses the 19400current directory for its temporary files. 19401 19402The ST version of @code{gawk} searches for its program files as described in 19403@ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}. 19404The default value for the @code{AWKPATH} variable is taken from 19405@code{DEFPATH} defined in @file{Makefile}. The sample @code{gcc}/TOS 19406@file{Makefile} for the ST in the distribution sets @code{DEFPATH} to 19407@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. The search path can be 19408modified by explicitly setting @code{AWKPATH} to whatever you wish. 19409Note that colons cannot be used on the ST to separate elements in the 19410@code{AWKPATH} variable, since they have another, reserved, meaning. 19411Instead, you must use a comma to separate elements in the path. When 19412recompiling, the separating character can be modified by initializing 19413the @code{envsep} variable in @file{atari/gawkmisc.atr} to another 19414value. 19415 19416Although @code{awk} allows great flexibility in doing I/O redirections 19417from within a program, this facility should be used with care on the ST 19418running under TOS. In some circumstances the OS routines for file 19419handle pool processing lose track of certain events, causing the 19420computer to crash, and requiring a reboot. Often a warm reboot is 19421sufficient. Fortunately, this happens infrequently, and in rather 19422esoteric situations. In particular, avoid having one part of an 19423@code{awk} program using @code{print} statements explicitly redirected 19424to @code{"/dev/stdout"}, while other @code{print} statements use the 19425default standard output, and a calling shell has redirected standard 19426output to a file. 19427 19428When @code{gawk} is compiled with the ST version of @code{gcc} and its 19429usual libraries, it will accept both @samp{/} and @samp{\} as path separators. 19430While this is convenient, it should be remembered that this removes one, 19431technically valid, character (@samp{/}) from your file names, and that 19432it may create problems for external programs, called via the @code{system} 19433function, which may not support this convention. Whenever it is possible 19434that a file created by @code{gawk} will be used by some other program, 19435use only backslashes. Also remember that in @code{awk}, backslashes in 19436strings have to be doubled in order to get literal backslashes 19437(@pxref{Escape Sequences}). 19438 19439@node Amiga Installation, Bugs, Atari Installation, Installation 19440@appendixsec Installing @code{gawk} on an Amiga 19441 19442@cindex amiga 19443@cindex installation, amiga 19444You can install @code{gawk} on an Amiga system using a Unix emulation 19445environment available via anonymous @code{ftp} from 19446@code{ftp.ninemoons.com} in the directory @file{pub/ade/current}. 19447This includes a shell based on @code{pdksh}. The primary component of 19448this environment is a Unix emulation library, @file{ixemul.lib}. 19449@c could really use more background here, who wrote this, etc. 19450 19451A more complete distribution for the Amiga is available on 19452the Geek Gadgets CD-ROM from: 19453 19454@quotation 19455CRONUS @* 194561840 E. Warner Road #105-265 @* 19457Tempe, AZ 85284 USA @* 19458US Toll Free: (800) 804-0833 @* 19459Phone: +1-602-491-0442 @* 19460FAX: +1-602-491-0048 @* 19461Email: @code{info@@ninemoons.com} @* 19462WWW: @code{http://www.ninemoons.com} @* 19463Anonymous @code{ftp} site: @code{ftp.ninemoons.com} @* 19464@end quotation 19465 19466Once you have the distribution, you can configure @code{gawk} simply by 19467running @code{configure}: 19468 19469@example 19470configure -v m68k-amigaos 19471@end example 19472 19473Then run @code{make}, and you should be all set! 19474(If these steps do not work, please send in a bug report; 19475@pxref{Bugs, ,Reporting Problems and Bugs}.) 19476 19477@node Bugs, Other Versions, Amiga Installation, Installation 19478@appendixsec Reporting Problems and Bugs 19479@display 19480@i{There is nothing more dangerous than a bored archeologist.} 19481The Hitchhiker's Guide to the Galaxy 19482@c the radio show, not the book. :-) 19483@end display 19484@sp 1 19485 19486If you have problems with @code{gawk} or think that you have found a bug, 19487please report it to the developers; we cannot promise to do anything 19488but we might well want to fix it. 19489 19490Before reporting a bug, make sure you have actually found a real bug. 19491Carefully reread the documentation and see if it really says you can do 19492what you're trying to do. If it's not clear whether you should be able 19493to do something or not, report that too; it's a bug in the documentation! 19494 19495Before reporting a bug or trying to fix it yourself, try to isolate it 19496to the smallest possible @code{awk} program and input data file that 19497reproduces the problem. Then send us the program and data file, 19498some idea of what kind of Unix system you're using, and the exact results 19499@code{gawk} gave you. Also say what you expected to occur; this will help 19500us decide whether the problem was really in the documentation. 19501 19502Once you have a precise problem, send email to @email{bug-gawk@@gnu.org}. 19503 19504Please include the version number of @code{gawk} you are using. 19505You can get this information with the command @samp{gawk --version}. 19506Using this address will automatically send a carbon copy of your 19507mail to Arnold Robbins. If necessary, he can be reached directly at 19508@email{arnold@@gnu.org}. 19509 19510@cindex @code{comp.lang.awk} 19511@strong{Important!} Do @emph{not} try to report bugs in @code{gawk} by 19512posting to the Usenet/Internet newsgroup @code{comp.lang.awk}. 19513While the @code{gawk} developers do occasionally read this newsgroup, 19514there is no guarantee that we will see your posting. The steps described 19515above are the official, recognized ways for reporting bugs. 19516 19517Non-bug suggestions are always welcome as well. If you have questions 19518about things that are unclear in the documentation or are just obscure 19519features, ask Arnold Robbins; he will try to help you out, although he 19520may not have the time to fix the problem. You can send him electronic 19521mail at the Internet address above. 19522 19523If you find bugs in one of the non-Unix ports of @code{gawk}, please send 19524an electronic mail message to the person who maintains that port. They 19525are listed below, and also in the @file{README} file in the @code{gawk} 19526distribution. Information in the @file{README} file should be considered 19527authoritative if it conflicts with this @value{DOCUMENT}. 19528 19529@c NEEDED for looks 19530@page 19531The people maintaining the non-Unix ports of @code{gawk} are: 19532 19533@cindex Deifik, Scott 19534@cindex Fish, Fred 19535@cindex Hankerson, Darrel 19536@cindex Jaegermann, Michal 19537@cindex Rankin, Pat 19538@cindex Rommel, Kai Uwe 19539@table @asis 19540@item MS-DOS 19541Scott Deifik, @samp{scottd@@amgen.com}, and 19542Darrel Hankerson, @samp{hankedr@@mail.auburn.edu}. 19543 19544@item OS/2 19545Kai Uwe Rommel, @samp{rommel@@ars.de}. 19546 19547@item VMS 19548Pat Rankin, @samp{rankin@@eql.caltech.edu}. 19549 19550@item Atari ST 19551Michal Jaegermann, @samp{michal@@gortel.phys.ualberta.ca}. 19552 19553@item Amiga 19554Fred Fish, @samp{fnf@@ninemoons.com}. 19555@end table 19556 19557If your bug is also reproducible under Unix, please send copies of your 19558report to the general GNU bug list, as well as to Arnold Robbins, at the 19559addresses listed above. 19560 19561@node Other Versions, , Bugs, Installation 19562@appendixsec Other Freely Available @code{awk} Implementations 19563@cindex Brennan, Michael 19564@ignore 19565From: emory!amc.com!brennan (Michael Brennan) 19566Subject: C++ comments in awk programs 19567To: arnold@gnu.ai.mit.edu (Arnold Robbins) 19568Date: Wed, 4 Sep 1996 08:11:48 -0700 (PDT) 19569 19570@end ignore 19571@display 19572@i{It's kind of fun to put comments like this in your awk code.} 19573 @code{// Do C++ comments work? answer: yes! of course} 19574Michael Brennan 19575@end display 19576@sp 1 19577 19578There are two other freely available @code{awk} implementations. 19579This section briefly describes where to get them. 19580 19581@table @asis 19582@cindex Kernighan, Brian 19583@cindex anonymous @code{ftp} 19584@cindex @code{ftp}, anonymous 19585@item Unix @code{awk} 19586Brian Kernighan has been able to make his implementation of 19587@code{awk} freely available. You can get it via anonymous @code{ftp} 19588to the host @code{@w{netlib.bell-labs.com}}. Change directory to 19589@file{/netlib/research}. Use ``binary'' or ``image'' mode, and 19590retrieve @file{awk.bundle.gz}. 19591 19592This is a shell archive that has been compressed with the GNU @code{gzip} 19593utility. It can be uncompressed with the @code{gunzip} utility. 19594 19595You can also retrieve this version via the World Wide Web from his 19596@uref{http://cm.bell-labs.com/who/bwk, home page}. 19597 19598This version requires an ANSI C compiler; GCC (the GNU C compiler) 19599works quite nicely. 19600 19601@cindex Brennan, Michael 19602@cindex @code{mawk} 19603@item @code{mawk} 19604Michael Brennan has written an independent implementation of @code{awk}, 19605called @code{mawk}. It is available under the GPL 19606(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}), 19607just as @code{gawk} is. 19608 19609You can get it via anonymous @code{ftp} to the host 19610@code{@w{ftp.whidbey.net}}. Change directory to @file{/pub/brennan}. 19611Use ``binary'' or ``image'' mode, and retrieve @file{mawk1.3.3.tar.gz} 19612(or the latest version that is there). 19613 19614@code{gunzip} may be used to decompress this file. Installation 19615is similar to @code{gawk}'s 19616(@pxref{Unix Installation, , Compiling and Installing @code{gawk} on Unix}). 19617@end table 19618 19619@node Notes, Glossary, Installation, Top 19620@appendix Implementation Notes 19621 19622This appendix contains information mainly of interest to implementors and 19623maintainers of @code{gawk}. Everything in it applies specifically to 19624@code{gawk}, and not to other implementations. 19625 19626@menu 19627* Compatibility Mode:: How to disable certain @code{gawk} extensions. 19628* Additions:: Making Additions To @code{gawk}. 19629* Future Extensions:: New features that may be implemented one day. 19630* Improvements:: Suggestions for improvements by volunteers. 19631@end menu 19632 19633@node Compatibility Mode, Additions, Notes, Notes 19634@appendixsec Downward Compatibility and Debugging 19635 19636@xref{POSIX/GNU, ,Extensions in @code{gawk} Not in POSIX @code{awk}}, 19637for a summary of the GNU extensions to the @code{awk} language and program. 19638All of these features can be turned off by invoking @code{gawk} with the 19639@samp{--traditional} option, or with the @samp{--posix} option. 19640 19641If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there 19642is one more option available on the command line: 19643 19644@table @code 19645@item -W parsedebug 19646@itemx --parsedebug 19647Print out the parse stack information as the program is being parsed. 19648@end table 19649 19650This option is intended only for serious @code{gawk} developers, 19651and not for the casual user. It probably has not even been compiled into 19652your version of @code{gawk}, since it slows down execution. 19653 19654@node Additions, Future Extensions, Compatibility Mode, Notes 19655@appendixsec Making Additions to @code{gawk} 19656 19657If you should find that you wish to enhance @code{gawk} in a significant 19658fashion, you are perfectly free to do so. That is the point of having 19659free software; the source code is available, and you are free to change 19660it as you wish (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}). 19661 19662This section discusses the ways you might wish to change @code{gawk}, 19663and any considerations you should bear in mind. 19664 19665@menu 19666* Adding Code:: Adding code to the main body of @code{gawk}. 19667* New Ports:: Porting @code{gawk} to a new operating system. 19668@end menu 19669 19670@node Adding Code, New Ports, Additions, Additions 19671@appendixsubsec Adding New Features 19672 19673@cindex adding new features 19674@cindex features, adding 19675You are free to add any new features you like to @code{gawk}. 19676However, if you want your changes to be incorporated into the @code{gawk} 19677distribution, there are several steps that you need to take in order to 19678make it possible for me to include your changes. 19679 19680@enumerate 1 19681@item 19682Get the latest version. 19683It is much easier for me to integrate changes if they are relative to 19684the most recent distributed version of @code{gawk}. If your version of 19685@code{gawk} is very old, I may not be able to integrate them at all. 19686@xref{Getting, ,Getting the @code{gawk} Distribution}, 19687for information on getting the latest version of @code{gawk}. 19688 19689@item 19690@iftex 19691Follow the @cite{GNU Coding Standards}. 19692@end iftex 19693@ifinfo 19694See @inforef{Top, , Version, standards, GNU Coding Standards}. 19695@end ifinfo 19696This document describes how GNU software should be written. If you haven't 19697read it, please do so, preferably @emph{before} starting to modify @code{gawk}. 19698(The @cite{GNU Coding Standards} are available as part of the Autoconf 19699distribution, from the FSF.) 19700 19701@cindex @code{gawk} coding style 19702@cindex coding style used in @code{gawk} 19703@item 19704Use the @code{gawk} coding style. 19705The C code for @code{gawk} follows the instructions in the 19706@cite{GNU Coding Standards}, with minor exceptions. The code is formatted 19707using the traditional ``K&R'' style, particularly as regards the placement 19708of braces and the use of tabs. In brief, the coding rules for @code{gawk} 19709are: 19710 19711@itemize @bullet 19712@item 19713Use old style (non-prototype) function headers when defining functions. 19714 19715@item 19716Put the name of the function at the beginning of its own line. 19717 19718@item 19719Put the return type of the function, even if it is @code{int}, on the 19720line above the line with the name and arguments of the function. 19721 19722@item 19723The declarations for the function arguments should not be indented. 19724 19725@item 19726Put spaces around parentheses used in control structures 19727(@code{if}, @code{while}, @code{for}, @code{do}, @code{switch} 19728and @code{return}). 19729 19730@item 19731Do not put spaces in front of parentheses used in function calls. 19732 19733@item 19734Put spaces around all C operators, and after commas in function calls. 19735 19736@item 19737Do not use the comma operator to produce multiple side-effects, except 19738in @code{for} loop initialization and increment parts, and in macro bodies. 19739 19740@item 19741Use real tabs for indenting, not spaces. 19742 19743@item 19744Use the ``K&R'' brace layout style. 19745 19746@item 19747Use comparisons against @code{NULL} and @code{'\0'} in the conditions of 19748@code{if}, @code{while} and @code{for} statements, and in the @code{case}s 19749of @code{switch} statements, instead of just the 19750plain pointer or character value. 19751 19752@item 19753Use the @code{TRUE}, @code{FALSE}, and @code{NULL} symbolic constants, 19754and the character constant @code{'\0'} where appropriate, instead of @code{1} 19755and @code{0}. 19756 19757@item 19758Provide one-line descriptive comments for each function. 19759 19760@item 19761Do not use @samp{#elif}. Many older Unix C compilers cannot handle it. 19762 19763@item 19764Do not use the @code{alloca} function for allocating memory off the stack. 19765Its use causes more portability trouble than the minor benefit of not having 19766to free the storage. Instead, use @code{malloc} and @code{free}. 19767@end itemize 19768 19769If I have to reformat your code to follow the coding style used in 19770@code{gawk}, I may not bother. 19771 19772@item 19773Be prepared to sign the appropriate paperwork. 19774In order for the FSF to distribute your changes, you must either place 19775those changes in the public domain, and submit a signed statement to that 19776effect, or assign the copyright in your changes to the FSF. 19777Both of these actions are easy to do, and @emph{many} people have done so 19778already. If you have questions, please contact me 19779(@pxref{Bugs, , Reporting Problems and Bugs}), 19780or @code{gnu@@gnu.org}. 19781 19782@item 19783Update the documentation. 19784Along with your new code, please supply new sections and or chapters 19785for this @value{DOCUMENT}. If at all possible, please use real 19786Texinfo, instead of just supplying unformatted ASCII text (although 19787even that is better than no documentation at all). 19788Conventions to be followed in @cite{@value{TITLE}} are provided 19789after the @samp{@@bye} at the end of the Texinfo source file. 19790If possible, please update the man page as well. 19791 19792You will also have to sign paperwork for your documentation changes. 19793 19794@item 19795Submit changes as context diffs or unified diffs. 19796Use @samp{diff -c -r -N} or @samp{diff -u -r -N} to compare 19797the original @code{gawk} source tree with your version. 19798(I find context diffs to be more readable, but unified diffs are 19799more compact.) 19800I recommend using the GNU version of @code{diff}. 19801Send the output produced by either run of @code{diff} to me when you 19802submit your changes. 19803@xref{Bugs, , Reporting Problems and Bugs}, for the electronic mail 19804information. 19805 19806Using this format makes it easy for me to apply your changes to the 19807master version of the @code{gawk} source code (using @code{patch}). 19808If I have to apply the changes manually, using a text editor, I may 19809not do so, particularly if there are lots of changes. 19810 19811@item 19812Include an entry for the @file{ChangeLog} file with your submission. 19813This further helps minimize the amount of work I have to do, 19814making it easier for me to accept patches. 19815@end enumerate 19816 19817Although this sounds like a lot of work, please remember that while you 19818may write the new code, I have to maintain it and support it, and if it 19819isn't possible for me to do that with a minimum of extra work, then I 19820probably will not. 19821 19822 19823@node New Ports, , Adding Code, Additions 19824@appendixsubsec Porting @code{gawk} to a New Operating System 19825 19826@cindex porting @code{gawk} 19827If you wish to port @code{gawk} to a new operating system, there are 19828several steps to follow. 19829 19830@enumerate 1 19831@item 19832Follow the guidelines in 19833@ref{Adding Code, ,Adding New Features}, 19834concerning coding style, submission of diffs, and so on. 19835 19836@item 19837When doing a port, bear in mind that your code must co-exist peacefully 19838with the rest of @code{gawk}, and the other ports. Avoid gratuitous 19839changes to the system-independent parts of the code. If at all possible, 19840avoid sprinkling @samp{#ifdef}s just for your port throughout the 19841code. 19842 19843If the changes needed for a particular system affect too much of the 19844code, I probably will not accept them. In such a case, you will, of course, 19845be able to distribute your changes on your own, as long as you comply 19846with the GPL 19847(@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}). 19848 19849@item 19850A number of the files that come with @code{gawk} are maintained by other 19851people at the Free Software Foundation. Thus, you should not change them 19852unless it is for a very good reason. I.e.@: changes are not out of the 19853question, but changes to these files will be scrutinized extra carefully. 19854The files are @file{alloca.c}, @file{getopt.h}, @file{getopt.c}, 19855@file{getopt1.c}, @file{regex.h}, @file{regex.c}, @file{dfa.h}, 19856@file{dfa.c}, @file{install-sh}, and @file{mkinstalldirs}. 19857 19858@item 19859Be willing to continue to maintain the port. 19860Non-Unix operating systems are supported by volunteers who maintain 19861the code needed to compile and run @code{gawk} on their systems. If no-one 19862volunteers to maintain a port, that port becomes unsupported, and it may 19863be necessary to remove it from the distribution. 19864 19865@item 19866Supply an appropriate @file{gawkmisc.???} file. 19867Each port has its own @file{gawkmisc.???} that implements certain 19868operating system specific functions. This is cleaner than a plethora of 19869@samp{#ifdef}s scattered throughout the code. The @file{gawkmisc.c} in 19870the main source directory includes the appropriate 19871@file{gawkmisc.???} file from each subdirectory. 19872Be sure to update it as well. 19873 19874Each port's @file{gawkmisc.???} file has a suffix reminiscent of the machine 19875or operating system for the port. For example, @file{pc/gawkmisc.pc} and 19876@file{vms/gawkmisc.vms}. The use of separate suffixes, instead of plain 19877@file{gawkmisc.c}, makes it possible to move files from a port's subdirectory 19878into the main subdirectory, without accidentally destroying the real 19879@file{gawkmisc.c} file. (Currently, this is only an issue for the MS-DOS 19880and OS/2 ports.) 19881 19882@item 19883Supply a @file{Makefile} and any other C source and header files that are 19884necessary for your operating system. All your code should be in a 19885separate subdirectory, with a name that is the same as, or reminiscent 19886of, either your operating system or the computer system. If possible, 19887try to structure things so that it is not necessary to move files out 19888of the subdirectory into the main source directory. If that is not 19889possible, then be sure to avoid using names for your files that 19890duplicate the names of files in the main source directory. 19891 19892@item 19893Update the documentation. 19894Please write a section (or sections) for this @value{DOCUMENT} describing the 19895installation and compilation steps needed to install and/or compile 19896@code{gawk} for your system. 19897 19898@item 19899Be prepared to sign the appropriate paperwork. 19900In order for the FSF to distribute your code, you must either place 19901your code in the public domain, and submit a signed statement to that 19902effect, or assign the copyright in your code to the FSF. 19903@ifinfo 19904Both of these actions are easy to do, and @emph{many} people have done so 19905already. If you have questions, please contact me, or 19906@code{gnu@@gnu.org}. 19907@end ifinfo 19908@end enumerate 19909 19910Following these steps will make it much easier to integrate your changes 19911into @code{gawk}, and have them co-exist happily with the code for other 19912operating systems that is already there. 19913 19914In the code that you supply, and that you maintain, feel free to use a 19915coding style and brace layout that suits your taste. 19916 19917@node Future Extensions, Improvements, Additions, Notes 19918@appendixsec Probable Future Extensions 19919@ignore 19920From emory!scalpel.netlabs.com!lwall Tue Oct 31 12:43:17 1995 19921Return-Path: <emory!scalpel.netlabs.com!lwall> 19922Message-Id: <9510311732.AA28472@scalpel.netlabs.com> 19923To: arnold@skeeve.atl.ga.us (Arnold D. Robbins) 19924Subject: Re: May I quote you? 19925In-Reply-To: Your message of "Tue, 31 Oct 95 09:11:00 EST." 19926 <m0tAHPQ-00014MC@skeeve.atl.ga.us> 19927Date: Tue, 31 Oct 95 09:32:46 -0800 19928From: Larry Wall <emory!scalpel.netlabs.com!lwall> 19929 19930: Greetings. I am working on the release of gawk 3.0. Part of it will be a 19931: thoroughly updated manual. One of the sections deals with planned future 19932: extensions and enhancements. I have the following at the beginning 19933: of it: 19934: 19935: @cindex PERL 19936: @cindex Wall, Larry 19937: @display 19938: @i{AWK is a language similar to PERL, only considerably more elegant.} @* 19939: Arnold Robbins 19940: @sp 1 19941: @i{Hey!} @* 19942: Larry Wall 19943: @end display 19944: 19945: Before I actually release this for publication, I wanted to get your 19946: permission to quote you. (Hopefully, in the spirit of much of GNU, the 19947: implied humor is visible... :-) 19948 19949I think that would be fine. 19950 19951Larry 19952@end ignore 19953@cindex PERL 19954@cindex Wall, Larry 19955@display 19956@i{AWK is a language similar to PERL, only considerably more elegant.} 19957Arnold Robbins 19958 19959@i{Hey!} 19960Larry Wall 19961@end display 19962@sp 1 19963 19964This section briefly lists extensions and possible improvements 19965that indicate the directions we are 19966currently considering for @code{gawk}. The file @file{FUTURES} in the 19967@code{gawk} distributions lists these extensions as well. 19968 19969This is a list of probable future changes that will be usable by the 19970@code{awk} language programmer. 19971 19972@c these are ordered by likelihood 19973@table @asis 19974@item Localization 19975The GNU project is starting to support multiple languages. 19976It will at least be possible to make @code{gawk} print its warnings and 19977error messages in languages other than English. 19978It may be possible for @code{awk} programs to also use the multiple 19979language facilities, separate from @code{gawk} itself. 19980 19981@item Databases 19982It may be possible to map a GDBM/NDBM/SDBM file into an @code{awk} array. 19983 19984@item A @code{PROCINFO} Array 19985The special files that provide process-related information 19986(@pxref{Special Files, ,Special File Names in @code{gawk}}) 19987will be superseded by a @code{PROCINFO} array that would provide the same 19988information, in an easier to access fashion. 19989 19990@item More @code{lint} warnings 19991There are more things that could be checked for portability. 19992 19993@item Control of subprocess environment 19994Changes made in @code{gawk} to the array @code{ENVIRON} may be 19995propagated to subprocesses run by @code{gawk}. 19996 19997@ignore 19998@item @code{RECLEN} variable for fixed length records 19999Along with @code{FIELDWIDTHS}, this would speed up the processing of 20000fixed-length records. 20001 20002@item A @code{restart} keyword 20003After modifying @code{$0}, @code{restart} would restart the pattern 20004matching loop, without reading a new record from the input. 20005 20006@item A @samp{|&} redirection 20007The @samp{|&} redirection, in place of @samp{|}, would open a two-way 20008pipeline for communication with a sub-process (via @code{getline} and 20009@code{print} and @code{printf}). 20010 20011@item Function valued variables 20012It would be possible to assign the name of a user-defined or built-in 20013function to a regular @code{awk} variable, and then call the function 20014indirectly, by using the regular variable. This would make it possible 20015to write general purpose sorting and comparing routines, for example, 20016by simply passing the name of one function into another. 20017 20018@item A built-in @code{stat} function 20019The @code{stat} function would provide an easy-to-use hook to the 20020@code{stat} system call so that @code{awk} programs could determine information 20021about files. 20022 20023@item A built-in @code{ftw} function 20024Combined with function valued variables and the @code{stat} function, 20025@code{ftw} (file tree walk) would make it easy for an @code{awk} program 20026to walk an entire file tree. 20027@end ignore 20028@end table 20029 20030This is a list of probable improvements that will make @code{gawk} 20031perform better. 20032 20033@table @asis 20034@item An Improved Version of @code{dfa} 20035The @code{dfa} pattern matcher from GNU @code{grep} has some 20036problems. Either a new version or a fixed one will deal with some 20037important regexp matching issues. 20038 20039@item Use of GNU @code{malloc} 20040The GNU version of @code{malloc} could potentially speed up @code{gawk}, 20041since it relies heavily on the use of dynamic memory allocation. 20042 20043@end table 20044 20045@node Improvements, , Future Extensions, Notes 20046@appendixsec Suggestions for Improvements 20047 20048Here are some projects that would-be @code{gawk} hackers might like to take 20049on. They vary in size from a few days to a few weeks of programming, 20050depending on which one you choose and how fast a programmer you are. Please 20051send any improvements you write to the maintainers at the GNU project. 20052@xref{Adding Code, , Adding New Features}, 20053for guidelines to follow when adding new features to @code{gawk}. 20054@xref{Bugs, ,Reporting Problems and Bugs}, for information on 20055contacting the maintainers. 20056 20057@enumerate 20058@item 20059Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like) 20060parser to convert the script given it into a syntax tree; the syntax 20061tree is then executed by a simple recursive evaluator. This method incurs 20062a lot of overhead, since the recursive evaluator performs many procedure 20063calls to do even the simplest things. 20064 20065It should be possible for @code{gawk} to convert the script's parse tree 20066into a C program which the user would then compile, using the normal 20067C compiler and a special @code{gawk} library to provide all the needed 20068functions (regexps, fields, associative arrays, type coercion, and so 20069on). 20070 20071An easier possibility might be for an intermediate phase of @code{awk} to 20072convert the parse tree into a linear byte code form like the one used 20073in GNU Emacs Lisp. The recursive evaluator would then be replaced by 20074a straight line byte code interpreter that would be intermediate in speed 20075between running a compiled program and doing what @code{gawk} does 20076now. 20077 20078@item 20079The programs in the test suite could use documenting in this @value{DOCUMENT}. 20080 20081@item 20082See the @file{FUTURES} file for more ideas. Contact us if you would 20083seriously like to tackle any of the items listed there. 20084@end enumerate 20085 20086@node Glossary, Copying, Notes, Top 20087@appendix Glossary 20088 20089@table @asis 20090@item Action 20091A series of @code{awk} statements attached to a rule. If the rule's 20092pattern matches an input record, @code{awk} executes the 20093rule's action. Actions are always enclosed in curly braces. 20094@xref{Action Overview, ,Overview of Actions}. 20095 20096@item Amazing @code{awk} Assembler 20097Henry Spencer at the University of Toronto wrote a retargetable assembler 20098completely as @code{awk} scripts. It is thousands of lines long, including 20099machine descriptions for several eight-bit microcomputers. 20100It is a good example of a 20101program that would have been better written in another language. 20102 20103@item Amazingly Workable Formatter (@code{awf}) 20104Henry Spencer at the University of Toronto wrote a formatter that accepts 20105a large subset of the @samp{nroff -ms} and @samp{nroff -man} formatting 20106commands, using @code{awk} and @code{sh}. 20107 20108@item ANSI 20109The American National Standards Institute. This organization produces 20110many standards, among them the standards for the C and C++ programming 20111languages. 20112 20113@item Assignment 20114An @code{awk} expression that changes the value of some @code{awk} 20115variable or data object. An object that you can assign to is called an 20116@dfn{lvalue}. The assigned values are called @dfn{rvalues}. 20117@xref{Assignment Ops, ,Assignment Expressions}. 20118 20119@item @code{awk} Language 20120The language in which @code{awk} programs are written. 20121 20122@item @code{awk} Program 20123An @code{awk} program consists of a series of @dfn{patterns} and 20124@dfn{actions}, collectively known as @dfn{rules}. For each input record 20125given to the program, the program's rules are all processed in turn. 20126@code{awk} programs may also contain function definitions. 20127 20128@item @code{awk} Script 20129Another name for an @code{awk} program. 20130 20131@item Bash 20132The GNU version of the standard shell (the Bourne-Again shell). 20133See ``Bourne Shell.'' 20134 20135@item BBS 20136See ``Bulletin Board System.'' 20137 20138@item Boolean Expression 20139Named after the English mathematician Boole. See ``Logical Expression.'' 20140 20141@item Bourne Shell 20142The standard shell (@file{/bin/sh}) on Unix and Unix-like systems, 20143originally written by Steven R.@: Bourne. 20144Many shells (Bash, @code{ksh}, @code{pdksh}, @code{zsh}) are 20145generally upwardly compatible with the Bourne shell. 20146 20147@item Built-in Function 20148The @code{awk} language provides built-in functions that perform various 20149numerical, time stamp related, and string computations. Examples are 20150@code{sqrt} (for the square root of a number) and @code{substr} (for a 20151substring of a string). @xref{Built-in, ,Built-in Functions}. 20152 20153@item Built-in Variable 20154@code{ARGC}, @code{ARGIND}, @code{ARGV}, @code{CONVFMT}, @code{ENVIRON}, 20155@code{ERRNO}, @code{FIELDWIDTHS}, @code{FILENAME}, @code{FNR}, @code{FS}, 20156@code{IGNORECASE}, @code{NF}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS}, 20157@code{RLENGTH}, @code{RSTART}, @code{RS}, @code{RT}, and @code{SUBSEP}, 20158are the variables that have special meaning to @code{awk}. 20159Changing some of them affects @code{awk}'s running environment. 20160Several of these variables are specific to @code{gawk}. 20161@xref{Built-in Variables}. 20162 20163@item Braces 20164See ``Curly Braces.'' 20165 20166@item Bulletin Board System 20167A computer system allowing users to log in and read and/or leave messages 20168for other users of the system, much like leaving paper notes on a bulletin 20169board. 20170 20171@item C 20172The system programming language that most GNU software is written in. The 20173@code{awk} programming language has C-like syntax, and this @value{DOCUMENT} 20174points out similarities between @code{awk} and C when appropriate. 20175 20176@cindex ISO 8859-1 20177@cindex ISO Latin-1 20178@item Character Set 20179The set of numeric codes used by a computer system to represent the 20180characters (letters, numbers, punctuation, etc.) of a particular country 20181or place. The most common character set in use today is ASCII (American 20182Standard Code for Information Interchange). Many European 20183countries use an extension of ASCII known as ISO-8859-1 (ISO Latin-1). 20184 20185@item CHEM 20186A preprocessor for @code{pic} that reads descriptions of molecules 20187and produces @code{pic} input for drawing them. It was written in @code{awk} 20188by Brian Kernighan and Jon Bentley, and is available from 20189@email{@w{netlib@@research.bell-labs.com}}. 20190 20191@item Compound Statement 20192A series of @code{awk} statements, enclosed in curly braces. Compound 20193statements may be nested. 20194@xref{Statements, ,Control Statements in Actions}. 20195 20196@item Concatenation 20197Concatenating two strings means sticking them together, one after another, 20198giving a new string. For example, the string @samp{foo} concatenated with 20199the string @samp{bar} gives the string @samp{foobar}. 20200@xref{Concatenation, ,String Concatenation}. 20201 20202@item Conditional Expression 20203An expression using the @samp{?:} ternary operator, such as 20204@samp{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression 20205@var{expr1} is evaluated; if the result is true, the value of the whole 20206expression is the value of @var{expr2}, otherwise the value is 20207@var{expr3}. In either case, only one of @var{expr2} and @var{expr3} 20208is evaluated. @xref{Conditional Exp, ,Conditional Expressions}. 20209 20210@item Comparison Expression 20211A relation that is either true or false, such as @samp{(a < b)}. 20212Comparison expressions are used in @code{if}, @code{while}, @code{do}, 20213and @code{for} 20214statements, and in patterns to select which input records to process. 20215@xref{Typing and Comparison, ,Variable Typing and Comparison Expressions}. 20216 20217@item Curly Braces 20218The characters @samp{@{} and @samp{@}}. Curly braces are used in 20219@code{awk} for delimiting actions, compound statements, and function 20220bodies. 20221 20222@item Dark Corner 20223An area in the language where specifications often were (or still 20224are) not clear, leading to unexpected or undesirable behavior. 20225Such areas are marked in this @value{DOCUMENT} with ``(d.c.)'' in the 20226text, and are indexed under the heading ``dark corner.'' 20227 20228@item Data Objects 20229These are numbers and strings of characters. Numbers are converted into 20230strings and vice versa, as needed. 20231@xref{Conversion, ,Conversion of Strings and Numbers}. 20232 20233@item Double Precision 20234An internal representation of numbers that can have fractional parts. 20235Double precision numbers keep track of more digits than do single precision 20236numbers, but operations on them are more expensive. This is the way 20237@code{awk} stores numeric values. It is the C type @code{double}. 20238 20239@item Dynamic Regular Expression 20240A dynamic regular expression is a regular expression written as an 20241ordinary expression. It could be a string constant, such as 20242@code{"foo"}, but it may also be an expression whose value can vary. 20243@xref{Computed Regexps, , Using Dynamic Regexps}. 20244 20245@item Environment 20246A collection of strings, of the form @var{name@code{=}val}, that each 20247program has available to it. Users generally place values into the 20248environment in order to provide information to various programs. Typical 20249examples are the environment variables @code{HOME} and @code{PATH}. 20250 20251@item Empty String 20252See ``Null String.'' 20253 20254@item Escape Sequences 20255A special sequence of characters used for describing non-printing 20256characters, such as @samp{\n} for newline, or @samp{\033} for the ASCII 20257ESC (escape) character. @xref{Escape Sequences}. 20258 20259@item Field 20260When @code{awk} reads an input record, it splits the record into pieces 20261separated by whitespace (or by a separator regexp which you can 20262change by setting the built-in variable @code{FS}). Such pieces are 20263called fields. If the pieces are of fixed length, you can use the built-in 20264variable @code{FIELDWIDTHS} to describe their lengths. 20265@xref{Field Separators, ,Specifying How Fields are Separated}, 20266and also see 20267@xref{Constant Size, , Reading Fixed-width Data}. 20268 20269@item Floating Point Number 20270Often referred to in mathematical terms as a ``rational'' number, this is 20271just a number that can have a fractional part. 20272See ``Double Precision'' and ``Single Precision.'' 20273 20274@item Format 20275Format strings are used to control the appearance of output in the 20276@code{printf} statement. Also, data conversions from numbers to strings 20277are controlled by the format string contained in the built-in variable 20278@code{CONVFMT}. @xref{Control Letters, ,Format-Control Letters}. 20279 20280@item Function 20281A specialized group of statements used to encapsulate general 20282or program-specific tasks. @code{awk} has a number of built-in 20283functions, and also allows you to define your own. 20284@xref{Built-in, ,Built-in Functions}, 20285and @ref{User-defined, ,User-defined Functions}. 20286 20287@item FSF 20288See ``Free Software Foundation.'' 20289 20290@item Free Software Foundation 20291A non-profit organization dedicated 20292to the production and distribution of freely distributable software. 20293It was founded by Richard M.@: Stallman, the author of the original 20294Emacs editor. GNU Emacs is the most widely used version of Emacs today. 20295 20296@item @code{gawk} 20297The GNU implementation of @code{awk}. 20298 20299@item General Public License 20300This document describes the terms under which @code{gawk} and its source 20301code may be distributed. (@pxref{Copying, ,GNU GENERAL PUBLIC LICENSE}) 20302 20303@item GNU 20304``GNU's not Unix''. An on-going project of the Free Software Foundation 20305to create a complete, freely distributable, POSIX-compliant computing 20306environment. 20307 20308@item GPL 20309See ``General Public License.'' 20310 20311@item Hexadecimal 20312Base 16 notation, where the digits are @code{0}-@code{9} and 20313@code{A}-@code{F}, with @samp{A} 20314representing 10, @samp{B} representing 11, and so on up to @samp{F} for 15. 20315Hexadecimal numbers are written in C using a leading @samp{0x}, 20316to indicate their base. Thus, @code{0x12} is 18 (one times 16 plus 2). 20317 20318@item I/O 20319Abbreviation for ``Input/Output,'' the act of moving data into and/or 20320out of a running program. 20321 20322@item Input Record 20323A single chunk of data read in by @code{awk}. Usually, an @code{awk} input 20324record consists of one line of text. 20325@xref{Records, ,How Input is Split into Records}. 20326 20327@item Integer 20328A whole number, i.e.@: a number that does not have a fractional part. 20329 20330@item Keyword 20331In the @code{awk} language, a keyword is a word that has special 20332meaning. Keywords are reserved and may not be used as variable names. 20333 20334@code{gawk}'s keywords are: 20335@code{BEGIN}, 20336@code{END}, 20337@code{if}, 20338@code{else}, 20339@code{while}, 20340@code{do@dots{}while}, 20341@code{for}, 20342@code{for@dots{}in}, 20343@code{break}, 20344@code{continue}, 20345@code{delete}, 20346@code{next}, 20347@code{nextfile}, 20348@code{function}, 20349@code{func}, 20350and @code{exit}. 20351 20352@item Logical Expression 20353An expression using the operators for logic, AND, OR, and NOT, written 20354@samp{&&}, @samp{||}, and @samp{!} in @code{awk}. Often called Boolean 20355expressions, after the mathematician who pioneered this kind of 20356mathematical logic. 20357 20358@item Lvalue 20359An expression that can appear on the left side of an assignment 20360operator. In most languages, lvalues can be variables or array 20361elements. In @code{awk}, a field designator can also be used as an 20362lvalue. 20363 20364@item Null String 20365A string with no characters in it. It is represented explicitly in 20366@code{awk} programs by placing two double-quote characters next to 20367each other (@code{""}). It can appear in input data by having two successive 20368occurrences of the field separator appear next to each other. 20369 20370@item Number 20371A numeric valued data object. The @code{gawk} implementation uses double 20372precision floating point to represent numbers. 20373Very old @code{awk} implementations use single precision floating 20374point. 20375 20376@item Octal 20377Base-eight notation, where the digits are @code{0}-@code{7}. 20378Octal numbers are written in C using a leading @samp{0}, 20379to indicate their base. Thus, @code{013} is 11 (one times 8 plus 3). 20380 20381@item Pattern 20382Patterns tell @code{awk} which input records are interesting to which 20383rules. 20384 20385A pattern is an arbitrary conditional expression against which input is 20386tested. If the condition is satisfied, the pattern is said to @dfn{match} 20387the input record. A typical pattern might compare the input record against 20388a regular expression. @xref{Pattern Overview, ,Pattern Elements}. 20389 20390@item POSIX 20391The name for a series of standards being developed by the IEEE 20392that specify a Portable Operating System interface. The ``IX'' denotes 20393the Unix heritage of these standards. The main standard of interest for 20394@code{awk} users is 20395@cite{IEEE Standard for Information Technology, Standard 1003.2-1992, 20396Portable Operating System Interface (POSIX) Part 2: Shell and Utilities}. 20397Informally, this standard is often referred to as simply ``P1003.2.'' 20398 20399@item Private 20400Variables and/or functions that are meant for use exclusively by library 20401functions, and not for the main @code{awk} program. Special care must be 20402taken when naming such variables and functions. 20403@xref{Library Names, , Naming Library Function Global Variables}. 20404 20405@item Range (of input lines) 20406A sequence of consecutive lines from the input file. A pattern 20407can specify ranges of input lines for @code{awk} to process, or it can 20408specify single lines. @xref{Pattern Overview, ,Pattern Elements}. 20409 20410@item Recursion 20411When a function calls itself, either directly or indirectly. 20412If this isn't clear, refer to the entry for ``recursion.'' 20413 20414@item Redirection 20415Redirection means performing input from other than the standard input 20416stream, or output to other than the standard output stream. 20417 20418You can redirect the output of the @code{print} and @code{printf} statements 20419to a file or a system command, using the @samp{>}, @samp{>>}, and @samp{|} 20420operators. You can redirect input to the @code{getline} statement using 20421the @samp{<} and @samp{|} operators. 20422@xref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}, 20423and @ref{Getline, ,Explicit Input with @code{getline}}. 20424 20425@item Regexp 20426Short for @dfn{regular expression}. A regexp is a pattern that denotes a 20427set of strings, possibly an infinite set. For example, the regexp 20428@samp{R.*xp} matches any string starting with the letter @samp{R} 20429and ending with the letters @samp{xp}. In @code{awk}, regexps are 20430used in patterns and in conditional expressions. Regexps may contain 20431escape sequences. @xref{Regexp, ,Regular Expressions}. 20432 20433@item Regular Expression 20434See ``regexp.'' 20435 20436@item Regular Expression Constant 20437A regular expression constant is a regular expression written within 20438slashes, such as @code{/foo/}. This regular expression is chosen 20439when you write the @code{awk} program, and cannot be changed doing 20440its execution. @xref{Regexp Usage, ,How to Use Regular Expressions}. 20441 20442@item Rule 20443A segment of an @code{awk} program that specifies how to process single 20444input records. A rule consists of a @dfn{pattern} and an @dfn{action}. 20445@code{awk} reads an input record; then, for each rule, if the input record 20446satisfies the rule's pattern, @code{awk} executes the rule's action. 20447Otherwise, the rule does nothing for that input record. 20448 20449@item Rvalue 20450A value that can appear on the right side of an assignment operator. 20451In @code{awk}, essentially every expression has a value. These values 20452are rvalues. 20453 20454@item @code{sed} 20455See ``Stream Editor.'' 20456 20457@item Short-Circuit 20458The nature of the @code{awk} logical operators @samp{&&} and @samp{||}. 20459If the value of the entire expression can be deduced from evaluating just 20460the left-hand side of these operators, the right-hand side will not 20461be evaluated 20462(@pxref{Boolean Ops, ,Boolean Expressions}). 20463 20464@item Side Effect 20465A side effect occurs when an expression has an effect aside from merely 20466producing a value. Assignment expressions, increment and decrement 20467expressions and function calls have side effects. 20468@xref{Assignment Ops, ,Assignment Expressions}. 20469 20470@item Single Precision 20471An internal representation of numbers that can have fractional parts. 20472Single precision numbers keep track of fewer digits than do double precision 20473numbers, but operations on them are less expensive in terms of CPU time. 20474This is the type used by some very old versions of @code{awk} to store 20475numeric values. It is the C type @code{float}. 20476 20477@item Space 20478The character generated by hitting the space bar on the keyboard. 20479 20480@item Special File 20481A file name interpreted internally by @code{gawk}, instead of being handed 20482directly to the underlying operating system. For example, @file{/dev/stderr}. 20483@xref{Special Files, ,Special File Names in @code{gawk}}. 20484 20485@item Stream Editor 20486A program that reads records from an input stream and processes them one 20487or more at a time. This is in contrast with batch programs, which may 20488expect to read their input files in entirety before starting to do 20489anything, and with interactive programs, which require input from the 20490user. 20491 20492@item String 20493A datum consisting of a sequence of characters, such as @samp{I am a 20494string}. Constant strings are written with double-quotes in the 20495@code{awk} language, and may contain escape sequences. 20496@xref{Escape Sequences}. 20497 20498@item Tab 20499The character generated by hitting the @kbd{TAB} key on the keyboard. 20500It usually expands to up to eight spaces upon output. 20501 20502@item Unix 20503A computer operating system originally developed in the early 1970's at 20504AT&T Bell Laboratories. It initially became popular in universities around 20505the world, and later moved into commercial evnironments as a software 20506development system and network server system. There are many commercial 20507versions of Unix, as well as several work-alike systems whose source code 20508is freely available (such as Linux, NetBSD, and FreeBSD). 20509 20510@item Whitespace 20511A sequence of space, tab, or newline characters occurring inside an input 20512record or a string. 20513@end table 20514 20515@node Copying, Index, Glossary, Top 20516@unnumbered GNU GENERAL PUBLIC LICENSE 20517@center Version 2, June 1991 20518 20519@display 20520Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc. 2052159 Temple Place --- Suite 330, Boston, MA 02111-1307, USA 20522 20523Everyone is permitted to copy and distribute verbatim copies 20524of this license document, but changing it is not allowed. 20525@end display 20526 20527@c fakenode --- for prepinfo 20528@unnumberedsec Preamble 20529 20530 The licenses for most software are designed to take away your 20531freedom to share and change it. By contrast, the GNU General Public 20532License is intended to guarantee your freedom to share and change free 20533software---to make sure the software is free for all its users. This 20534General Public License applies to most of the Free Software 20535Foundation's software and to any other program whose authors commit to 20536using it. (Some other Free Software Foundation software is covered by 20537the GNU Library General Public License instead.) You can apply it to 20538your programs, too. 20539 20540 When we speak of free software, we are referring to freedom, not 20541price. Our General Public Licenses are designed to make sure that you 20542have the freedom to distribute copies of free software (and charge for 20543this service if you wish), that you receive source code or can get it 20544if you want it, that you can change the software or use pieces of it 20545in new free programs; and that you know you can do these things. 20546 20547 To protect your rights, we need to make restrictions that forbid 20548anyone to deny you these rights or to ask you to surrender the rights. 20549These restrictions translate to certain responsibilities for you if you 20550distribute copies of the software, or if you modify it. 20551 20552 For example, if you distribute copies of such a program, whether 20553gratis or for a fee, you must give the recipients all the rights that 20554you have. You must make sure that they, too, receive or can get the 20555source code. And you must show them these terms so they know their 20556rights. 20557 20558 We protect your rights with two steps: (1) copyright the software, and 20559(2) offer you this license which gives you legal permission to copy, 20560distribute and/or modify the software. 20561 20562 Also, for each author's protection and ours, we want to make certain 20563that everyone understands that there is no warranty for this free 20564software. If the software is modified by someone else and passed on, we 20565want its recipients to know that what they have is not the original, so 20566that any problems introduced by others will not reflect on the original 20567authors' reputations. 20568 20569 Finally, any free program is threatened constantly by software 20570patents. We wish to avoid the danger that redistributors of a free 20571program will individually obtain patent licenses, in effect making the 20572program proprietary. To prevent this, we have made it clear that any 20573patent must be licensed for everyone's free use or not licensed at all. 20574 20575 The precise terms and conditions for copying, distribution and 20576modification follow. 20577 20578@iftex 20579@c fakenode --- for prepinfo 20580@unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 20581@end iftex 20582@ifinfo 20583@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 20584@end ifinfo 20585 20586@enumerate 0 20587@item 20588This License applies to any program or other work which contains 20589a notice placed by the copyright holder saying it may be distributed 20590under the terms of this General Public License. The ``Program'', below, 20591refers to any such program or work, and a ``work based on the Program'' 20592means either the Program or any derivative work under copyright law: 20593that is to say, a work containing the Program or a portion of it, 20594either verbatim or with modifications and/or translated into another 20595language. (Hereinafter, translation is included without limitation in 20596the term ``modification''.) Each licensee is addressed as ``you''. 20597 20598Activities other than copying, distribution and modification are not 20599covered by this License; they are outside its scope. The act of 20600running the Program is not restricted, and the output from the Program 20601is covered only if its contents constitute a work based on the 20602Program (independent of having been made by running the Program). 20603Whether that is true depends on what the Program does. 20604 20605@item 20606You may copy and distribute verbatim copies of the Program's 20607source code as you receive it, in any medium, provided that you 20608conspicuously and appropriately publish on each copy an appropriate 20609copyright notice and disclaimer of warranty; keep intact all the 20610notices that refer to this License and to the absence of any warranty; 20611and give any other recipients of the Program a copy of this License 20612along with the Program. 20613 20614You may charge a fee for the physical act of transferring a copy, and 20615you may at your option offer warranty protection in exchange for a fee. 20616 20617@item 20618You may modify your copy or copies of the Program or any portion 20619of it, thus forming a work based on the Program, and copy and 20620distribute such modifications or work under the terms of Section 1 20621above, provided that you also meet all of these conditions: 20622 20623@enumerate a 20624@item 20625You must cause the modified files to carry prominent notices 20626stating that you changed the files and the date of any change. 20627 20628@item 20629You must cause any work that you distribute or publish, that in 20630whole or in part contains or is derived from the Program or any 20631part thereof, to be licensed as a whole at no charge to all third 20632parties under the terms of this License. 20633 20634@item 20635If the modified program normally reads commands interactively 20636when run, you must cause it, when started running for such 20637interactive use in the most ordinary way, to print or display an 20638announcement including an appropriate copyright notice and a 20639notice that there is no warranty (or else, saying that you provide 20640a warranty) and that users may redistribute the program under 20641these conditions, and telling the user how to view a copy of this 20642License. (Exception: if the Program itself is interactive but 20643does not normally print such an announcement, your work based on 20644the Program is not required to print an announcement.) 20645@end enumerate 20646 20647These requirements apply to the modified work as a whole. If 20648identifiable sections of that work are not derived from the Program, 20649and can be reasonably considered independent and separate works in 20650themselves, then this License, and its terms, do not apply to those 20651sections when you distribute them as separate works. But when you 20652distribute the same sections as part of a whole which is a work based 20653on the Program, the distribution of the whole must be on the terms of 20654this License, whose permissions for other licensees extend to the 20655entire whole, and thus to each and every part regardless of who wrote it. 20656 20657Thus, it is not the intent of this section to claim rights or contest 20658your rights to work written entirely by you; rather, the intent is to 20659exercise the right to control the distribution of derivative or 20660collective works based on the Program. 20661 20662In addition, mere aggregation of another work not based on the Program 20663with the Program (or with a work based on the Program) on a volume of 20664a storage or distribution medium does not bring the other work under 20665the scope of this License. 20666 20667@item 20668You may copy and distribute the Program (or a work based on it, 20669under Section 2) in object code or executable form under the terms of 20670Sections 1 and 2 above provided that you also do one of the following: 20671 20672@enumerate a 20673@item 20674Accompany it with the complete corresponding machine-readable 20675source code, which must be distributed under the terms of Sections 206761 and 2 above on a medium customarily used for software interchange; or, 20677 20678@item 20679Accompany it with a written offer, valid for at least three 20680years, to give any third party, for a charge no more than your 20681cost of physically performing source distribution, a complete 20682machine-readable copy of the corresponding source code, to be 20683distributed under the terms of Sections 1 and 2 above on a medium 20684customarily used for software interchange; or, 20685 20686@item 20687Accompany it with the information you received as to the offer 20688to distribute corresponding source code. (This alternative is 20689allowed only for non-commercial distribution and only if you 20690received the program in object code or executable form with such 20691an offer, in accord with Subsection b above.) 20692@end enumerate 20693 20694The source code for a work means the preferred form of the work for 20695making modifications to it. For an executable work, complete source 20696code means all the source code for all modules it contains, plus any 20697associated interface definition files, plus the scripts used to 20698control compilation and installation of the executable. However, as a 20699special exception, the source code distributed need not include 20700anything that is normally distributed (in either source or binary 20701form) with the major components (compiler, kernel, and so on) of the 20702operating system on which the executable runs, unless that component 20703itself accompanies the executable. 20704 20705If distribution of executable or object code is made by offering 20706access to copy from a designated place, then offering equivalent 20707access to copy the source code from the same place counts as 20708distribution of the source code, even though third parties are not 20709compelled to copy the source along with the object code. 20710 20711@item 20712You may not copy, modify, sublicense, or distribute the Program 20713except as expressly provided under this License. Any attempt 20714otherwise to copy, modify, sublicense or distribute the Program is 20715void, and will automatically terminate your rights under this License. 20716However, parties who have received copies, or rights, from you under 20717this License will not have their licenses terminated so long as such 20718parties remain in full compliance. 20719 20720@item 20721You are not required to accept this License, since you have not 20722signed it. However, nothing else grants you permission to modify or 20723distribute the Program or its derivative works. These actions are 20724prohibited by law if you do not accept this License. Therefore, by 20725modifying or distributing the Program (or any work based on the 20726Program), you indicate your acceptance of this License to do so, and 20727all its terms and conditions for copying, distributing or modifying 20728the Program or works based on it. 20729 20730@item 20731Each time you redistribute the Program (or any work based on the 20732Program), the recipient automatically receives a license from the 20733original licensor to copy, distribute or modify the Program subject to 20734these terms and conditions. You may not impose any further 20735restrictions on the recipients' exercise of the rights granted herein. 20736You are not responsible for enforcing compliance by third parties to 20737this License. 20738 20739@item 20740If, as a consequence of a court judgment or allegation of patent 20741infringement or for any other reason (not limited to patent issues), 20742conditions are imposed on you (whether by court order, agreement or 20743otherwise) that contradict the conditions of this License, they do not 20744excuse you from the conditions of this License. If you cannot 20745distribute so as to satisfy simultaneously your obligations under this 20746License and any other pertinent obligations, then as a consequence you 20747may not distribute the Program at all. For example, if a patent 20748license would not permit royalty-free redistribution of the Program by 20749all those who receive copies directly or indirectly through you, then 20750the only way you could satisfy both it and this License would be to 20751refrain entirely from distribution of the Program. 20752 20753If any portion of this section is held invalid or unenforceable under 20754any particular circumstance, the balance of the section is intended to 20755apply and the section as a whole is intended to apply in other 20756circumstances. 20757 20758It is not the purpose of this section to induce you to infringe any 20759patents or other property right claims or to contest validity of any 20760such claims; this section has the sole purpose of protecting the 20761integrity of the free software distribution system, which is 20762implemented by public license practices. Many people have made 20763generous contributions to the wide range of software distributed 20764through that system in reliance on consistent application of that 20765system; it is up to the author/donor to decide if he or she is willing 20766to distribute software through any other system and a licensee cannot 20767impose that choice. 20768 20769This section is intended to make thoroughly clear what is believed to 20770be a consequence of the rest of this License. 20771 20772@item 20773If the distribution and/or use of the Program is restricted in 20774certain countries either by patents or by copyrighted interfaces, the 20775original copyright holder who places the Program under this License 20776may add an explicit geographical distribution limitation excluding 20777those countries, so that distribution is permitted only in or among 20778countries not thus excluded. In such case, this License incorporates 20779the limitation as if written in the body of this License. 20780 20781@item 20782The Free Software Foundation may publish revised and/or new versions 20783of the General Public License from time to time. Such new versions will 20784be similar in spirit to the present version, but may differ in detail to 20785address new problems or concerns. 20786 20787Each version is given a distinguishing version number. If the Program 20788specifies a version number of this License which applies to it and ``any 20789later version'', you have the option of following the terms and conditions 20790either of that version or of any later version published by the Free 20791Software Foundation. If the Program does not specify a version number of 20792this License, you may choose any version ever published by the Free Software 20793Foundation. 20794 20795@item 20796If you wish to incorporate parts of the Program into other free 20797programs whose distribution conditions are different, write to the author 20798to ask for permission. For software which is copyrighted by the Free 20799Software Foundation, write to the Free Software Foundation; we sometimes 20800make exceptions for this. Our decision will be guided by the two goals 20801of preserving the free status of all derivatives of our free software and 20802of promoting the sharing and reuse of software generally. 20803 20804@iftex 20805@c fakenode --- for prepinfo 20806@heading NO WARRANTY 20807@end iftex 20808@ifinfo 20809@center NO WARRANTY 20810@end ifinfo 20811 20812@item 20813BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY 20814FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW@. EXCEPT WHEN 20815OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES 20816PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED 20817OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 20818MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE@. THE ENTIRE RISK AS 20819TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU@. SHOULD THE 20820PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, 20821REPAIR OR CORRECTION. 20822 20823@item 20824IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING 20825WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR 20826REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, 20827INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING 20828OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED 20829TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY 20830YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER 20831PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE 20832POSSIBILITY OF SUCH DAMAGES. 20833@end enumerate 20834 20835@iftex 20836@c fakenode --- for prepinfo 20837@heading END OF TERMS AND CONDITIONS 20838@end iftex 20839@ifinfo 20840@center END OF TERMS AND CONDITIONS 20841@end ifinfo 20842 20843@page 20844@c fakenode --- for prepinfo 20845@unnumberedsec How to Apply These Terms to Your New Programs 20846 20847 If you develop a new program, and you want it to be of the greatest 20848possible use to the public, the best way to achieve this is to make it 20849free software which everyone can redistribute and change under these terms. 20850 20851 To do so, attach the following notices to the program. It is safest 20852to attach them to the start of each source file to most effectively 20853convey the exclusion of warranty; and each file should have at least 20854the ``copyright'' line and a pointer to where the full notice is found. 20855 20856@smallexample 20857@var{one line to give the program's name and an idea of what it does.} 20858Copyright (C) @var{year} @var{name of author} 20859 20860This program is free software; you can redistribute it and/or 20861modify it under the terms of the GNU General Public License 20862as published by the Free Software Foundation; either version 2 20863of the License, or (at your option) any later version. 20864 20865This program is distributed in the hope that it will be useful, 20866but WITHOUT ANY WARRANTY; without even the implied warranty of 20867MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE@. See the 20868GNU General Public License for more details. 20869 20870You should have received a copy of the GNU General Public License 20871along with this program; if not, write to the Free Software 20872Foundation, Inc., 59 Temple Place --- Suite 330, Boston, MA 02111-1307, USA. 20873@end smallexample 20874 20875Also add information on how to contact you by electronic and paper mail. 20876 20877If the program is interactive, make it output a short notice like this 20878when it starts in an interactive mode: 20879 20880@smallexample 20881Gnomovision version 69, Copyright (C) @var{year} @var{name of author} 20882Gnomovision comes with ABSOLUTELY NO WARRANTY; for details 20883type `show w'. This is free software, and you are welcome 20884to redistribute it under certain conditions; type `show c' 20885for details. 20886@end smallexample 20887 20888The hypothetical commands @samp{show w} and @samp{show c} should show 20889the appropriate parts of the General Public License. Of course, the 20890commands you use may be called something other than @samp{show w} and 20891@samp{show c}; they could even be mouse-clicks or menu items---whatever 20892suits your program. 20893 20894You should also get your employer (if you work as a programmer) or your 20895school, if any, to sign a ``copyright disclaimer'' for the program, if 20896necessary. Here is a sample; alter the names: 20897 20898@smallexample 20899@group 20900Yoyodyne, Inc., hereby disclaims all copyright 20901interest in the program `Gnomovision' 20902(which makes passes at compilers) written 20903by James Hacker. 20904 20905@var{signature of Ty Coon}, 1 April 1989 20906Ty Coon, President of Vice 20907@end group 20908@end smallexample 20909 20910This General Public License does not permit incorporating your program into 20911proprietary programs. If your program is a subroutine library, you may 20912consider it more useful to permit linking proprietary applications with the 20913library. If this is what you want to do, use the GNU Library General 20914Public License instead of this License. 20915 20916@node Index, , Copying, Top 20917@unnumbered Index 20918@printindex cp 20919 20920@summarycontents 20921@contents 20922@bye 20923 20924Unresolved Issues: 20925------------------ 209261. From ADR. 20927 20928 Robert J. Chassell points out that awk programs should have some indication 20929 of how to use them. It would be useful to perhaps have a "programming 20930 style" section of the manual that would include this and other tips. 20931 209322. The default AWKPATH search path should be configurable via `configure' 20933 The default and how this changes needs to be documented. 20934 20935Consistency issues: 20936 /.../ regexps are in @code, not @samp 20937 ".." strings are in @code, not @samp 20938 no @print before @dots 20939 values of expressions in the text (@code{x} has the value 15), 20940 should be in roman, not @code 20941 Use tab and not TAB 20942 Use ESC and not ESCAPE 20943 Use space and not blank to describe the space bar's character 20944 The term "blank" is thus basically reserved for "blank lines" etc. 20945 The `(d.c.)' should appear inside the closing `.' of a sentence 20946 It should come before (pxref{...}) 20947 " " should have an @w{} around it 20948 Use "non-" everywhere 20949 Use @code{ftp} when talking about anonymous ftp 20950 Use upper-case and lower-case, not "upper case" and "lower case" 20951 Use alphanumeric, not alpha-numeric 20952 Use --foo, not -Wfoo when describing long options 20953 Use findex for all programs and functions in the example chapters 20954 Use "Bell Laboratories", but not "Bell Labs". 20955 Use "behavior" instead of "behaviour". 20956 Use "zeros" instead of "zeroes". 20957 Use "Input/Output", not "input/output". Also "I/O", not "i/o". 20958 Use @code{do}, and not @code{do}-@code{while}, except where 20959 actually discussing the do-while. 20960 The words "a", "and", "as", "between", "for", "from", "in", "of", 20961 "on", "that", "the", "to", "with", and "without", 20962 should not be capitalized in @chapter, @section etc. 20963 "Into" and "How" should. 20964 Search for @dfn; make sure important items are also indexed. 20965 "e.g." should always be followed by a comma. 20966 "i.e." should never be followed by a comma, and should be followed 20967 by `@:'. 20968 The numbers zero through ten should be spelled out, except when 20969 talking about file descriptor numbers. > 10 and < 0, it's 20970 ok to use numbers. 20971 In tables, put command line options in @code, while in the text, 20972 put them in @samp. 20973 When using @strong, use "Note:" or "Caution:" with colons and 20974 not exclamation points. Do not surround the paragraphs 20975 with @quotation ... @end quotation. 20976 20977Date: Wed, 13 Apr 94 15:20:52 -0400 20978From: rsm@gnu.ai.mit.edu (Richard Stallman) 20979To: gnu-prog@gnu.ai.mit.edu 20980Subject: A reminder: no pathnames in GNU 20981 20982It's a GNU convention to use the term "file name" for the name of a 20983file, never "pathname". We use the term "path" for search paths, 20984which are lists of file names. Using it for a single file name as 20985well is potentially confusing to users. 20986 20987So please check any documentation you maintain, if you think you might 20988have used "pathname". 20989 20990Note that "file name" should be two words when it appears as ordinary 20991text. It's ok as one word when it's a metasyntactic variable, though. 20992 20993Suggestions: 20994------------ 20995Enhance FIELDWIDTHS with some way to indicate "the rest of the record". 20996E.g., a length of 0 or -1 or something. May be "n"? 20997 20998Make FIELDWIDTHS be an array? 20999 21000What if FIELDWIDTHS has invalid values in it? 21001