1%% /u/sy/beebe/tex/bibsort/README.AWK, Sat Nov 9 15:35:00 1996
2%% Edit by Nelson H. F. Beebe <beebe@plot79.math.utah.edu>
3
4These notes provide some information about awk, in case you are
5unfamiliar with it, or want to learn more about it.
6
7I use the awk programming language for implementing many of my
8software tools (I have written more than 114,000 lines of awk code as
9of [09-Nov-1996]), and I use it in teaching as an example of a little
10language that every computer user who does text processing can benefit
11from learning.
12
13While awk is an interpreted language which suffers a runtime
14performance penalty compared to natively compiled languages such as
15Ada, C, C++, Fortran, and Pascal, for many text processing problems it
16is almost perfect. A C implementation of my bibcheck utility ran 3.5
17times faster than the awk version, but took 22.4 times as many lines
18of code! And, of course, the awk version was much easier to write,
19and required very little debugging.
20
21awk is a POSIX standard, though I don't yet have on hand the POSIX awk
22language description. This means that you can expect your computer
23vendor to provide it, and that it should be widely available for a
24long time.
25
26awk is a clean simple language, with few blemishes. This is in stark
27contrast to perl, which I find so ugly that I refuse to learn it, even
28though I deeply appreciate what it is trying to do.
29
30The official description of awk is found in the book
31
32@String{pub-AW = "Ad{\-d}i{\-s}on-Wes{\-l}ey"}
33@String{pub-AW:adr = "Reading, MA, USA"}
34
35@Book{Aho:1987:APL,
36 author = "Alfred V. Aho and Brian W. Kernighan and Peter J.
37 Weinberger",
38 title = "The {AWK} Programming Language",
39 publisher = pub-AW,
40 address = pub-AW:adr,
41 pages = "x + 210",
42 year = "1988",
43 ISBN = "0-201-07981-X",
44 LCCN = "QA76.73.A95 A35 1988",
45 bibdate = "Tue Dec 14 22:33:46 1993",
46}
47
48Another book which you may find useful (though I much prefer the above
49one) is
50
51@String{pub-ORA = "O'Reilly \& {Associates, Inc.}"}
52@String{pub-ORA:adr = "981 Chestnut Street, Newton, MA 02164, USA"}
53
54@Book{Dougherty:SA91,
55 author = "Dale Dougherty",
56 title = "sed {\&} awk",
57 publisher = pub-ORA,
58 address = pub-ORA:adr,
59 pages = "xxii + 394",
60 year = "1991",
61 ISBN = "0-937175-59-5",
62 LCCN = "QA76.76.U84 D69 1991",
63}
64
65There is also a recent one (based on the GNU awk implementation) that
66I have not yet seen:
67
68@String{pub-SSC = "Specialized Systems Consultants"}
69@String{pub-SSC:adr = "P.O. Box 55549, Seattle, WA 98155"}
70
71@Book{Robbins:1996:EAP,
72 author = "Arnold Robbins",
73 title = "Effective {AWK} Programming",
74 publisher = pub-SSC,
75 address = pub-SSC:adr,
76 year = "1996",
77 URL = "http://www.ssc.com/ssc/eap/",
78 ISBN = "0-916151-88-3",
79 LCCN = "",
80 acknowledgement = ack-nhfb,
81 pages = "321",
82 price = "US\$27.00",
83 bibdate = "Fri Jun 14 17:24:04 1996",
84 libnote = "Not yet in my library.",
85}
86
87Some other publications on, and suppliers of, awk are:
88
89@String{j-SUNEXPERT = "SunExpert"}
90@Article{Collinson:awk,
91 author = "Peter Collinson",
92 title = "Awk",
93 journal = j-SUNEXPERT,
94 volume = "2",
95 number = "1",
96 pages = "33--36",
97 month = jan,
98 year = "1991",
99}
100
101@String{pub-FSF = "{Free Software Foundation}"}
102@String{pub-FSF:adr = "675 Mass Ave, Cambridge, MA 02139,
103 USA, Tel: (617) 876-3296"}
104
105@Misc{FSF:gawk,
106 key = "GAWK",
107 title = "The {GAWK} Manual",
108 howpublished = pub-FSF # " " # pub-FSF:adr,
109 year = "1987",
110 note = "Also available via ANONYMOUS FTP to
111 \path|prep.ai.mit.edu|. See also \cite{Aho:APL87}.",
112}
113
114@Misc{MKS:awk,
115 author = "{Mortice Kern Systems, Inc.}",
116 title = "{MKSAWK}",
117 year = "1987",
118 note = "35 King Street North, Waterloo, Ontario, Canada, Tel:
119 (519) 884-2251. See also \cite{Aho:APL87}.",
120}
121
122@Misc{ONW:awk,
123 author = "{OpenNetwork}",
124 title = "{The Berkeley Utilities}",
125 year = "1991",
126 note = "215 Berkeley Place, Brooklyn, NY 11217, USA, Tel:
127 (718) 398-3838.",
128 altnote = "See ad on p. 108 of April 1991 UNIX Review.",
129}
130
131@Misc{Polytron:polyawk,
132 author = "Polytron Corporation",
133 title = "{Poly{\-}AWK}",
134 year = "1987",
135 note = "170 NW 167th Place, Beaverton, OR 97006. See also
136 \cite{Aho:APL87}.",
137}
138
139@String{j-SPE = "Soft{\-}ware\emdash Prac{\-}tice
140 and Experience"}
141
142@Article{VanWyk:awk,
143 author = "Christopher J. Van Wyk",
144 title = "{AWK} as Glue for Programs",
145 journal = j-SPE,
146 volume = "16",
147 number = "4",
148 pages = "369--388",
149 month = apr,
150 year = "1986",
151}
152
153These entries are all taken from
154
155 ftp://ftp.math.utah.edu/pub/tex/bib/index.html#master
156
157which records books in my library, and other selected references; by
158the time you read this, there may be more awk-related entries in that
159bibliography.
160
161At the time that the Aho, Kernighan, and Weinberger book appeared, awk
162was only available in UNIX systems, and it took a few years for UNIX
163vendors to incorporate the new, and much enhanced, version of the
164language described in the book. Most UNIX vendors retain the name
165`awk' for the old original function-less language from 1978, and call
166the 1987 one `nawk' (for `new awk'). An important exception is IBM,
167which supplies the new implementation on RS/6000 AIX systems, but
168calls it just awk.
169
170Unfortunately, several vendors have not kept up with Brian Kernighan's
171further development of awk, with the result that some nawk
172implementations lack features that were added after the 1987 book was
173published, notably the ENVIRON[] array for access to environment
174variables. Also, some of the vendor implementations have not
175incorporated bug fixes which Kernighan introduced.
176
177Fortunately, this situation has improved through three important
178developments:
179
180 (1) Arnold Robbin's gawk, the GNU Project implementation of
181 awk, available at
182 ftp://prep.ai.mit.edu/pub/gnu/gawk-x.yy.tar.gz
183 The gawk distribution includes ports for the Amiga, the IBM
184 PC, the Atari, and for DEC OpenVMS.
185
186 (2) Brian Kernighan's awk has been released by AT&T Bell Labs,
187 and is available at
188 http://cm.bell-labs.com/who/bwk/awk.sh
189
190 (3) Mike Brennan's mawk, available at
191 ftp://ftp.whidbey.net/pub/brennan/mawkx.y.z.tar.gz
192
193Besides these freely-distributable (for non-commercial purposes, as
194detailed in licenses included with their distributions), there are
195commercially-supported versions of awk for the IBM PC world and other
196machines, recorded in the BibTeX entries above.
197
198Robbins, Kernighan, and Brennan are in contact with one another, so
199their implementations support the same features, although gawk has
200added a number of (well-documented) extensions that the others have
201not yet incorporated. With a few exceptions, I've tried hard in my
202awk programs to stick to the standard language as documented in the
2031987 book.
204
205 (1) gawk and recent AT&T awk have the IGNORECASE extension,
206 which I only rarely use. That feature is difficult to
207 simulate in a portable awk program.
208
209 (2) gawk and mawk have toupper() and tolower() for efficient
210 lettercase conversion; it is possible to implement these in
211 awk itself, but only very inefficiently
212
213 (3) gawk, mawk, and recent AT&T awk support the ENVIRON[]
214 array for efficient access to environment variables.
215
216 (4) gawk, mawk, and recent AT&T awk support the names
217 /dev/stderr and /dev/stdout for the standard UNIX devices
218 (which sadly, UNIX never got around to giving names), and gawk
219 and AT&T awk support /dev/stdin. The alternative to
220 /dev/stderr is /dev/tty (except it fails if the process is
221 running without a controlling terminal, which happens for
222 batch jobs, and for background processes), or a horrid
223 contortion to invoke the shell and cat. Since any realistic
224 program will require the ability to write error messages, use
225 of /dev/stderr is the one feature that is likely to cause
226 portability problems.
227
228 (5) only gawk is 8-bit clean, and capable of processing all
229 256 8-bit byte values, including NUL, and accepting 8-bit
230 characters in regexp patterns. Recent AT&T awk loses NUL
231 during processing (because it uses C-style strings internally,
232 which reserve NUL for a string terminator), and it rejects
233 characters 128..255 in regexp patterns. mawk gets thoroughly
234 confused by NUL in its input stream, and terminates; it
235 handles the other 255 byte values (1..255) correctly, at least
236 for I/O.
237
238For further discussion of awk implementation differences and language
239evolution, see the
240 (gawk.info)Language History
241node in the GNU Emacs info system.
242