awk
Functions
This chapter presents a library of useful awk
functions. The
sample programs presented later
(see section Practical awk
Programs)
use these functions.
The functions are presented here in a progression from simple to complex.
section Extracting Programs from Texinfo Source Files,
presents a program that you can use to extract the source code for
these example library functions and programs from the Texinfo source
for this book.
(This has already been done as part of the gawk
distribution.)
If you have written one or more useful, general purpose awk
functions,
and would like to contribute them for a subsequent edition of this book,
please contact the author. See section Reporting Problems and Bugs,
for information on doing this. Don't just send code, as you will be
required to either place your code in the public domain,
publish it under the GPL (see section GNU GENERAL PUBLIC LICENSE),
or assign the copyright in it to the Free Software Foundation.
gawk
-specific Features
The programs in this chapter and in
section Practical awk
Programs,
freely use features that are specific to gawk
.
This section briefly discusses how you can rewrite these programs for
different implementations of awk
.
Diagnostic error messages are sent to `/dev/stderr'.
Use `| "cat 1>&2"' instead of `> "/dev/stderr"', if your system
does not have a `/dev/stderr', or if you cannot use gawk
.
A number of programs use nextfile
(see section The nextfile
Statement),
to skip any remaining input in the input file.
section Implementing nextfile
as a Function,
shows you how to write a function that will do the same thing.
Finally, some of the programs choose to ignore upper-case and lower-case
distinctions in their input. They do this by assigning one to IGNORECASE
.
You can achieve the same effect by adding the following rule to the
beginning of the program:
# ignore case { $0 = tolower($0) }
Also, verify that all regexp and string constants used in comparisons only use lower-case letters.
nextfile
as a Function
The nextfile
statement presented in
section The nextfile
Statement,
is a gawk
-specific extension. It is not available in other
implementations of awk
. This section shows two versions of a
nextfile
function that you can use to simulate gawk
's
nextfile
statement if you cannot use gawk
.
Here is a first attempt at writing a nextfile
function.
# nextfile -- skip remaining records in current file # this should be read in before the "main" awk program function nextfile() { _abandon_ = FILENAME; next } _abandon_ == FILENAME { next }
This file should be included before the main program, because it supplies
a rule that must be executed first. This rule compares the current data
file's name (which is always in the FILENAME
variable) to a private
variable named _abandon_
. If the file name matches, then the action
part of the rule executes a next
statement, to go on to the next
record. (The use of `_' in the variable name is a convention.
It is discussed more fully in
section Naming Library Function Global Variables.)
The use of the next
statement effectively creates a loop that reads
all the records from the current data file.
Eventually, the end of the file is reached, and
a new data file is opened, changing the value of FILENAME
.
Once this happens, the comparison of _abandon_
to FILENAME
fails, and execution continues with the first rule of the "real" program.
The nextfile
function itself simply sets the value of _abandon_
and then executes a next
statement to start the loop
going.(16)
This initial version has a subtle problem. What happens if the same data file is listed twice on the command line, one right after the other, or even with just a variable assignment between the two occurrences of the file name?
In such a case,
this code will skip right through the file, a second time, even though
it should stop when it gets to the end of the first occurrence.
Here is a second version of nextfile
that remedies this problem.
# nextfile -- skip remaining records in current file # correctly handle successive occurrences of the same file # Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain # May, 1993 # this should be read in before the "main" awk program function nextfile() { _abandon_ = FILENAME; next } _abandon_ == FILENAME { if (FNR == 1) _abandon_ = "" else next }
The nextfile
function has not changed. It sets _abandon_
equal to the current file name and then executes a next
satement.
The next
statement reads the next record and increments FNR
,
so FNR
is guaranteed to have a value of at least two.
However, if nextfile
is called for the last record in the file,
then awk
will close the current data file and move on to the next
one. Upon doing so, FILENAME
will be set to the name of the new file,
and FNR
will be reset to one. If this next file is the same as
the previous one, _abandon_
will still be equal to FILENAME
.
However, FNR
will be equal to one, telling us that this is a new
occurrence of the file, and not the one we were reading when the
nextfile
function was executed. In that case, _abandon_
is reset to the empty string, so that further executions of this rule
will fail (until the next time that nextfile
is called).
If FNR
is not one, then we are still in the original data file,
and the program executes a next
statement to skip through it.
An important question to ask at this point is: "Given that the
functionality of nextfile
can be provided with a library file,
why is it built into gawk
?" This is an important question. Adding
features for little reason leads to larger, slower programs that are
harder to maintain.
The answer is that building nextfile
into gawk
provides
significant gains in efficiency. If the nextfile
function is executed
at the beginning of a large data file, awk
still has to scan the entire
file, splitting it up into records, just to skip over it. The built-in
nextfile
can simply close the file immediately and proceed to the
next one, saving a lot of time. This is particularly important in
awk
, since awk
programs are generally I/O bound (i.e.
they spend most of their time doing input and output, instead of performing
computations).
When writing large programs, it is often useful to be able to know
that a condition or set of conditions is true. Before proceeding with a
particular computation, you make a statement about what you believe to be
the case. Such a statement is known as an
"assertion." The C language provides an <assert.h>
header file
and corresponding assert
macro that the programmer can use to make
assertions. If an assertion fails, the assert
macro arranges to
print a diagnostic message describing the condition that should have
been true but was not, and then it kills the program. In C, using
assert
looks this:
#include <assert.h> int myfunc(int a, double b) { assert(a <= 5 && b >= 17); ... }
If the assertion failed, the program would print a message similar to this:
prog.c:5: assertion failed: a <= 5 && b >= 17
The ANSI C language makes it possible to turn the condition into a string for use
in printing the diagnostic message. This is not possible in awk
, so
this assert
function also requires a string version of the condition
that is being tested.
# assert -- assert that a condition is true. Otherwise exit. # Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain # May, 1993 function assert(condition, string) { if (! condition) { printf("%s:%d: assertion failed: %s\n", FILENAME, FNR, string) > "/dev/stderr" _assert_exit = 1 exit 1 } } END { if (_assert_exit) exit 1 }
The assert
function tests the condition
parameter. If it
is false, it prints a message to standard error, using the string
parameter to describe the failed condition. It then sets the variable
_assert_exit
to one, and executes the exit
statement.
The exit
statement jumps to the END
rule. If the END
rules finds _assert_exit
to be true, then it exits immediately.
The purpose of the END
rule with its test is to
keep any other END
rules from running. When an assertion fails, the
program should exit immediately.
If no assertions fail, then _assert_exit
will still be
false when the END
rule is run normally, and the rest of the
program's END
rules will execute.
For all of this to work correctly, `assert.awk' must be the
first source file read by awk
.
You would use this function in your programs this way:
function myfunc(a, b) { assert(a <= 5 && b >= 17, "a <= 5 && b >= 17") ... }
If the assertion failed, you would see a message like this:
mydata:1357: assertion failed: a <= 5 && b >= 17
There is a problem with this version of assert
, that it may not
be possible to work around. An END
rule is automatically added
to the program calling assert
. Normally, if a program consists
of just a BEGIN
rule, the input files and/or standard input are
not read. However, now that the program has an END
rule, awk
will attempt to read the input data files, or standard input
(see section Startup and Cleanup Actions),
most likely causing the program to hang, waiting for input.
Just a note on programming style. You may have noticed that the END
rule uses backslash continuation, with the open brace on a line by
itself. This is so that it more closely resembles the way functions
are written. Many of the examples
in this chapter and the next one
use this style. You can decide for yourself if you like writing
your BEGIN
and END
rules this way,
or not.
One commercial implementation of awk
supplies a built-in function,
ord
, which takes a character and returns the numeric value for that
character in the machine's character set. If the string passed to
ord
has more than one character, only the first one is used.
The inverse of this function is chr
(from the function of the same
name in Pascal), which takes a number and returns the corresponding character.
Both functions can be written very nicely in awk
; there is no real
reason to build them into the awk
interpreter.
# ord.awk -- do ord and chr # # Global identifiers: # _ord_: numerical values indexed by characters # _ord_init: function to initialize _ord_ # # Arnold Robbins # arnold@gnu.ai.mit.edu # Public Domain # 16 January, 1992 # 20 July, 1992, revised BEGIN { _ord_init() } function _ord_init( low, high, i, t) { low = sprintf("%c", 7) # BEL is ascii 7 if (low == "\a") { # regular ascii low = 0 high = 127 } else if (sprintf("%c", 128 + 7) == "\a") { # ascii, mark parity low = 128 high = 255 } else { # ebcdic(!) low = 0 high = 255 } for (i = low; i <= high; i++) { t = sprintf("%c", i) _ord_[t] = i } }
Some explanation of the numbers used by chr
is worthwhile.
The most prominent character set in use today is ASCII. Although an
eight-bit byte can hold 256 distinct values (from zero to 255), ASCII only
defines characters that use the values from zero to 127.(17)
At least one computer manufacturer that we know of
uses ASCII, but with mark parity, meaning that the leftmost bit in the byte
is always one. What this means is that on those systems, characters
have numeric values from 128 to 255.
Finally, large mainframe systems use the EBCDIC character set, which
uses all 256 values.
While there are other character sets in use on some older systems,
they are not really worth worrying about.
function ord(str, c) { # only first character is of interest c = substr(str, 1, 1) return _ord_[c] } function chr(c) { # force c to be numeric by adding 0 return sprintf("%c", c + 0) } #### test code #### # BEGIN \ # { # for (;;) { # printf("enter a character: ") # if (getline var <= 0) # break # printf("ord(%s) = %d\n", var, ord(var)) # } # }
An obvious improvement to these functions would be to move the code for the
_ord_init
function into the body of the BEGIN
rule. It was
written this way initially for ease of development.
There is a "test program" in a BEGIN
rule, for testing the
function. It is commented out for production use.
When doing string processing, it is often useful to be able to join
all the strings in an array into one long string. The following function,
join
, accomplishes this task. It is used later in several of
the application programs
(see section Practical awk
Programs).
Good function design is important; this function needs to be general, but it
should also have a reasonable default behavior. It is called with an array
and the beginning and ending indices of the elements in the array to be
merged. This assumes that the array indices are numeric--a reasonable
assumption since the array was likely created with split
(see section Built-in Functions for String Manipulation).
# join.awk -- join an array into a string # Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain # May 1993 function join(array, start, end, sep, result, i) { if (sep == "") sep = " " else if (sep == SUBSEP) # magic value sep = "" result = array[start] for (i = start + 1; i <= end; i++) result = result sep array[i] return result }
An optional additional argument is the separator to use when joining the
strings back together. If the caller supplies a non-empty value,
join
uses it. If it is not supplied, it will have a null
value. In this case, join
uses a single blank as a default
separator for the strings. If the value is equal to SUBSEP
,
then join
joins the strings with no separator between them.
SUBSEP
serves as a "magic" value to indicate that there should
be no separation between the component strings.
It would be nice if awk
had an assignment operator for concatenation.
The lack of an explicit operator for concatenation makes string operations
more difficult than they really need to be.
The systime
function built in to gawk
returns the current time of day as
a timestamp in "seconds since the Epoch." This timestamp
can be converted into a printable date of almost infinitely variable
format using the built-in strftime
function.
(For more information on systime
and strftime
,
see section Functions for Dealing with Time Stamps.)
An interesting but difficult problem is to convert a readable representation
of a date back into a timestamp. The ANSI C library provides a mktime
function that does the basic job, converting a canonical representation of a
date into a timestamp.
It would appear at first glance that gawk
would have to supply a
mktime
built-in function that was simply a "hook" to the C language
version. In fact though, mktime
can be implemented entirely in
awk
.
Here is a version of mktime
for awk
. It takes a simple
representation of the date and time, and converts it into a timestamp.
The code is presented here intermixed with explanatory prose. In section Extracting Programs from Texinfo Source Files, you will see how the Texinfo source file for this book can be processed to extract the code into a single source file.
The program begins with a descriptive comment and a BEGIN
rule
that initializes a table _tm_months
. This table is a two-dimensional
array that has the lengths of the months. The first index is zero for
regular years, and one for leap years. The values are the same for all the
months in both kinds of years, except for February; thus the use of multiple
assignment.
# mktime.awk -- convert a canonical date representation # into a timestamp # Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain # May 1993 BEGIN \ { # Initialize table of month lengths _tm_months[0,1] = _tm_months[1,1] = 31 _tm_months[0,2] = 28; _tm_months[1,2] = 29 _tm_months[0,3] = _tm_months[1,3] = 31 _tm_months[0,4] = _tm_months[1,4] = 30 _tm_months[0,5] = _tm_months[1,5] = 31 _tm_months[0,6] = _tm_months[1,6] = 30 _tm_months[0,7] = _tm_months[1,7] = 31 _tm_months[0,8] = _tm_months[1,8] = 31 _tm_months[0,9] = _tm_months[1,9] = 30 _tm_months[0,10] = _tm_months[1,10] = 31 _tm_months[0,11] = _tm_months[1,11] = 30 _tm_months[0,12] = _tm_months[1,12] = 31 }
The benefit of merging multiple BEGIN
rules
(see section The BEGIN
and END
Special Patterns)
is particularly clear when writing library files. Functions in library
files can cleanly initialize their own private data and also provide clean-up
actions in private END
rules.
The next function is a simple one that computes whether a given year is or is not a leap year. If a year is evenly divisible by four, but not evenly divisible by 100, or if it is evenly divisible by 400, then it is a leap year. Thus, 1904 was a leap year, 1900 was not, but 2000 will be.
# decide if a year is a leap year function _tm_isleap(year, ret) { ret = (year % 4 == 0 && year % 100 != 0) || (year % 400 == 0) return ret }
This function is only used a few times in this file, and its computation could have been written in-line (at the point where it's used). Making it a separate function made the original development easier, and also avoids the possibility of typing errors when duplicating the code in multiple places.
The next function is more interesting. It does most of the work of
generating a timestamp, which is converting a date and time into some number
of seconds since the Epoch. The caller passes an array (rather
imaginatively named a
) containing six
values: the year including century, the month as a number between one and 12,
the day of the month, the hour as a number between zero and 23, the minute in
the hour, and the seconds within the minute.
The function uses several local variables to precompute the number of seconds in an hour, seconds in a day, and seconds in a year. Often, similar C code simply writes out the expression in-line, expecting the compiler to do constant folding. E.g., most C compilers would turn `60 * 60' into `3600' at compile time, instead of recomputing it every time at run time. Precomputing these values makes the function more efficient.
# convert a date into seconds function _tm_addup(a, total, yearsecs, daysecs, hoursecs, i, j) { hoursecs = 60 * 60 daysecs = 24 * hoursecs yearsecs = 365 * daysecs total = (a[1] - 1970) * yearsecs # extra day for leap years for (i = 1970; i < a[1]; i++) if (_tm_isleap(i)) total += daysecs j = _tm_isleap(a[1]) for (i = 1; i < a[2]; i++) total += _tm_months[j, i] * daysecs total += (a[3] - 1) * daysecs total += a[4] * hoursecs total += a[5] * 60 total += a[6] return total }
The function starts with a first approximation of all the seconds between Midnight, January 1, 1970,(18) and the beginning of the current year. It then goes through all those years, and for every leap year, adds an additional day's worth of seconds.
The variable j
holds either one or zero, if the current year is or is not
a leap year.
For every month in the current year prior to the current month, it adds
the number of seconds in the month, using the appropriate entry in the
_tm_months
array.
Finally, it adds in the seconds for the number of days prior to the current day, and the number of hours, minutes, and seconds in the current day.
The result is a count of seconds since January 1, 1970. This value is not yet what is needed though. The reason why is described shortly.
The main mktime
function takes a single character string argument.
This string is a representation of a date and time in a "canonical"
(fixed) form. This string should be
"year month day hour minute second"
.
# mktime -- convert a date into seconds, # compensate for time zone function mktime(str, res1, res2, a, b, i, j, t, diff) { i = split(str, a, " ") # don't rely on FS if (i != 6) return -1 # force numeric for (j in a) a[j] += 0 # validate if (a[1] < 1970 || a[2] < 1 || a[2] > 12 || a[3] < 1 || a[3] > 31 || a[4] < 0 || a[4] > 23 || a[5] < 0 || a[5] > 59 || a[6] < 0 || a[6] > 61 ) return -1 res1 = _tm_addup(a) t = strftime("%Y %m %d %H %M %S", res1) if (_tm_debug) printf("(%s) -> (%s)\n", str, t) > "/dev/stderr" split(t, b, " ") res2 = _tm_addup(b) diff = res1 - res2 if (_tm_debug) printf("diff = %d seconds\n", diff) > "/dev/stderr" res1 += diff return res1 }
The function first splits the string into an array, using spaces and tabs as separators. If there are not six elements in the array, it returns an error, signaled as the value -1. Next, it forces each element of the array to be numeric, by adding zero to it. The following `if' statement then makes sure that each element is within an allowable range. (This checking could be extended further, e.g., to make sure that the day of the month is within the correct range for the particular month supplied.) All of this is essentially preliminary set-up and error checking.
Recall that _tm_addup
generated a value in seconds since Midnight,
January 1, 1970. This value is not directly usable as the result we want,
since the calculation does not account for the local timezone. In other
words, the value represents the count in seconds since the Epoch, but only
for UTC (Universal Coordinated Time). If the local timezone is east or west
of UTC, then some number of hours should be either added to, or subtracted from
the resulting timestamp.
For example, 6:23 p.m. in Atlanta, Georgia (USA), is normally five hours west
of (behind) UTC. It is only four hours behind UTC if daylight savings
time is in effect.
If you are calling mktime
in Atlanta, with the argument
"1993 5 23 18 23 12"
, the result from _tm_addup
will be
for 6:23 p.m. UTC, which is only 2:23 p.m. in Atlanta. It is necessary to
add another four hours worth of seconds to the result.
How can mktime
determine how far away it is from UTC? This is
surprisingly easy. The returned timestamp represents the time passed to
mktime
as UTC. This timestamp can be fed back to
strftime
, which will format it as a local time; i.e. as
if it already had the UTC difference added in to it. This is done by
giving "%Y %m %d %H %M %S"
to strftime
as the format
argument. It returns the computed timestamp in the original string
format. The result represents a time that accounts for the UTC
difference. When the new time is converted back to a timestamp, the
difference between the two timestamps is the difference (in seconds)
between the local timezone and UTC. This difference is then added back
to the original result. An example demonstrating this is presented below.
Finally, there is a "main" program for testing the function.
BEGIN { if (_tm_test) { printf "Enter date as yyyy mm dd hh mm ss: " getline _tm_test_date t = mktime(_tm_test_date) r = strftime("%Y %m %d %H %M %S", t) printf "Got back (%s)\n", r } }
The entire program uses two variables that can be set on the command
line to control debugging output and to enable the test in the final
BEGIN
rule. Here is the result of a test run. (Note that debugging
output is to standard error, and test output is to standard output.)
$ gawk -f mktime.awk -v _tm_test=1 -v _tm_debug=1 -| Enter date as yyyy mm dd hh mm ss: 1993 5 23 15 35 10 error--> (1993 5 23 15 35 10) -> (1993 05 23 11 35 10) error--> diff = 14400 seconds -| Got back (1993 05 23 15 35 10)
The time entered was 3:35 p.m. (15:35 on a 24-hour clock), on May 23, 1993. The first line of debugging output shows the resulting time as UTC--four hours ahead of the local time zone. The second line shows that the difference is 14400 seconds, which is four hours. (The difference is only four hours, since daylight savings time is in effect during May.) The final line of test output shows that the timezone compensation algorithm works; the returned time is the same as the entered time.
This program does not solve the general problem of turning an arbitrary date
representation into a timestamp. That problem is very involved. However,
the mktime
function provides a foundation upon which to build. Other
software can convert month names into numeric months, and AM/PM times into
24-hour clocks, to generate the "canonical" format that mktime
requires.
The systime
and strftime
functions described in
section Functions for Dealing with Time Stamps,
provide the minimum functionality necessary for dealing with the time of day
in human readable form. While strftime
is extensive, the control
formats are not necessarily easy to remember or intuitively obvious when
reading a program.
The following function, gettimeofday
, populates a user-supplied array
with pre-formatted time information. It returns a string with the current
time formatted in the same way as the date
utility.
# gettimeofday -- get the time of day in a usable format # Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain, May 1993 # # Returns a string in the format of output of date(1) # Populates the array argument time with individual values: # time["second"] -- seconds (0 - 59) # time["minute"] -- minutes (0 - 59) # time["hour"] -- hours (0 - 23) # time["althour"] -- hours (0 - 12) # time["monthday"] -- day of month (1 - 31) # time["month"] -- month of year (1 - 12) # time["monthname"] -- name of the month # time["shortmonth"] -- short name of the month # time["year"] -- year within century (0 - 99) # time["fullyear"] -- year with century (19xx or 20xx) # time["weekday"] -- day of week (Sunday = 0) # time["altweekday"] -- day of week (Monday = 0) # time["weeknum"] -- week number, Sunday first day # time["altweeknum"] -- week number, Monday first day # time["dayname"] -- name of weekday # time["shortdayname"] -- short name of weekday # time["yearday"] -- day of year (0 - 365) # time["timezone"] -- abbreviation of timezone name # time["ampm"] -- AM or PM designation function gettimeofday(time, ret, now, i) { # get time once, avoids unnecessary system calls now = systime() # return date(1)-style output ret = strftime("%a %b %d %H:%M:%S %Z %Y", now) # clear out target array for (i in time) delete time[i] # fill in values, force numeric values to be # numeric by adding 0 time["second"] = strftime("%S", now) + 0 time["minute"] = strftime("%M", now) + 0 time["hour"] = strftime("%H", now) + 0 time["althour"] = strftime("%I", now) + 0 time["monthday"] = strftime("%d", now) + 0 time["month"] = strftime("%m", now) + 0 time["monthname"] = strftime("%B", now) time["shortmonth"] = strftime("%b", now) time["year"] = strftime("%y", now) + 0 time["fullyear"] = strftime("%Y", now) + 0 time["weekday"] = strftime("%w", now) + 0 time["altweekday"] = strftime("%u", now) + 0 time["dayname"] = strftime("%A", now) time["shortdayname"] = strftime("%a", now) time["yearday"] = strftime("%j", now) + 0 time["timezone"] = strftime("%Z", now) time["ampm"] = strftime("%p", now) time["weeknum"] = strftime("%U", now) + 0 time["altweeknum"] = strftime("%W", now) + 0 return ret }
The string indices are easier to use and read than the various formats
required by strftime
. The alarm
program presented in
section An Alarm Clock Program,
uses this function.
The gettimeofday
function is presented above as it was written. A
more general design for this function would have allowed the user to supply
an optional timestamp value that would have been used instead of the current
time.
The BEGIN
and END
rules are each executed exactly once, at
the beginning and end respectively of your awk
program
(see section The BEGIN
and END
Special Patterns).
We (the gawk
authors) once had a user who mistakenly thought that the
BEGIN
rule was executed at the beginning of each data file and the
END
rule was executed at the end of each data file. When informed
that this was not the case, the user requested that we add new special
patterns to gawk
, named BEGIN_FILE
and END_FILE
, that
would have the desired behavior. He even supplied us the code to do so.
However, after a little thought, I came up with the following library program.
It arranges to call two user-supplied functions, beginfile
and
endfile
, at the beginning and end of each data file.
Besides solving the problem in only nine(!) lines of code, it does so
portably; this will work with any implementation of awk
.
# transfile.awk # # Give the user a hook for filename transitions # # The user must supply functions beginfile() and endfile() # that each take the name of the file being started or # finished, respectively. # # Arnold Robbins, arnold@gnu.ai.mit.edu, January 1992 # Public Domain FILENAME != _oldfilename \ { if (_oldfilename != "") endfile(_oldfilename) _oldfilename = FILENAME beginfile(FILENAME) } END { endfile(FILENAME) }
This file must be loaded before the user's "main" program, so that the rule it supplies will be executed first.
This rule relies on awk
's FILENAME
variable that
automatically changes for each new data file. The current file name is
saved in a private variable, _oldfilename
. If FILENAME
does
not equal _oldfilename
, then a new data file is being processed, and
it is necessary to call endfile
for the old file. Since
endfile
should only be called if a file has been processed, the
program first checks to make sure that _oldfilename
is not the null
string. The program then assigns the current file name to
_oldfilename
, and calls beginfile
for the file.
Since, like all awk
variables, _oldfilename
will be
initialized to the null string, this rule executes correctly even for the
first data file.
The program also supplies an END
rule, to do the final processing for
the last file. Since this END
rule comes before any END
rules
supplied in the "main" program, endfile
will be called first. Once
again the value of multiple BEGIN
and END
rules should be clear.
This version has same problem as the first version of nextfile
(see section Implementing nextfile
as a Function).
If the same data file occurs twice in a row on command line, then
endfile
and beginfile
will not be executed at the end of the
first pass and at the beginning of the second pass.
This version solves the problem.
# ftrans.awk -- handle data file transitions # # user supplies beginfile() and endfile() functions # # Arnold Robbins, arnold@gnu.ai.mit.edu. November 1992 # Public Domain FNR == 1 { if (_filename_ != "") endfile(_filename_) _filename_ = FILENAME beginfile(FILENAME) } END { endfile(_filename_) }
In section Counting Things, you will see how this library function can be used, and how it simplifies writing the main program.
Most utilities on POSIX compatible systems take options or "switches" on
the command line that can be used to change the way a program behaves.
awk
is an example of such a program
(see section Command Line Options).
Often, options take arguments, data that the program needs to
correctly obey the command line option. For example, awk
's
`-F' option requires a string to use as the field separator.
The first occurrence on the command line of either `--' or a
string that does not begin with `-' ends the options.
Most Unix systems provide a C function named getopt
for processing
command line arguments. The programmer provides a string describing the one
letter options. If an option requires an argument, it is followed in the
string with a colon. getopt
is also passed the
count and values of the command line arguments, and is called in a loop.
getopt
processes the command line arguments for option letters.
Each time around the loop, it returns a single character representing the
next option letter that it found, or `?' if it found an invalid option.
When it returns -1, there are no options left on the command line.
When using getopt
, options that do not take arguments can be
grouped together. Furthermore, options that take arguments require that the
argument be present. The argument can immediately follow the option letter,
or it can be a separate command line argument.
Given a hypothetical program that takes three command line options, `-a', `-b', and `-c', and `-b' requires an argument, all of the following are valid ways of invoking the program:
prog -a -b foo -c data1 data2 data3 prog -ac -bfoo -- data1 data2 data3 prog -acbfoo data1 data2 data3
Notice that when the argument is grouped with its option, the rest of the command line argument is considered to be the option's argument. In the above example, `-acbfoo' indicates that all of the `-a', `-b', and `-c' options were supplied, and that `foo' is the argument to the `-b' option.
getopt
provides four external variables that the programmer can use.
optind
argv
) where the first
non-option command line argument can be found.
optarg
opterr
getopt
prints an error message when it finds an invalid
option. Setting opterr
to zero disables this feature. (An
application might wish to print its own error message.)
optopt
The following C fragment shows how getopt
might process command line
arguments for awk
.
int main(int argc, char *argv[]) { ... /* print our own message */ opterr = 0; while ((c = getopt(argc, argv, "v:f:F:W:")) != -1) { switch (c) { case 'f': /* file */ ... break; case 'F': /* field separator */ ... break; case 'v': /* variable assignment */ ... break; case 'W': /* extension */ ... break; case '?': default: usage(); break; } } ... }
As a side point, gawk
actually uses the GNU getopt_long
function to process both normal and GNU-style long options
(see section Command Line Options).
The abstraction provided by getopt
is very useful, and would be quite
handy in awk
programs as well. Here is an awk
version of
getopt
. This function highlights one of the greatest weaknesses in
awk
, which is that it is very poor at manipulating single characters.
Repeated calls to substr
are necessary for accessing individual
characters (see section Built-in Functions for String Manipulation).
The discussion walks through the code a bit at a time.
# getopt -- do C library getopt(3) function in awk # # arnold@gnu.ai.mit.edu # Public domain # # Initial version: March, 1991 # Revised: May, 1993 # External variables: # Optind -- index of ARGV for first non-option argument # Optarg -- string value of argument to current option # Opterr -- if non-zero, print our own diagnostic # Optopt -- current option letter # Returns # -1 at end of options # ? for unrecognized option # <c> a character representing the current option # Private Data # _opti index in multi-flag option, e.g., -abc
The function starts out with some documentation: who wrote the code, and when it was revised, followed by a list of the global variables it uses, what the return values are and what they mean, and any global variables that are "private" to this library function. Such documentation is essential for any program, and particularly for library functions.
function getopt(argc, argv, options, optl, thisopt, i) { optl = length(options) if (optl == 0) # no options given return -1 if (argv[Optind] == "--") { # all done Optind++ _opti = 0 return -1 } else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) { _opti = 0 return -1 }
The function first checks that it was indeed called with a string of options
(the options
parameter). If options
has a zero length,
getopt
immediately returns -1.
The next thing to check for is the end of the options. A `--' ends the
command line options, as does any command line argument that does not begin
with a `-'. Optind
is used to step through the array of command
line arguments; it retains its value across calls to getopt
, since it
is a global variable.
The regexp used, /^-[^: \t\n\f\r\v\b]/
, is
perhaps a bit of overkill; it checks for a `-' followed by anything
that is not whitespace and not a colon.
If the current command line argument does not match this pattern,
it is not an option, and it ends option processing.
if (_opti == 0) _opti = 2 thisopt = substr(argv[Optind], _opti, 1) Optopt = thisopt i = index(options, thisopt) if (i == 0) { if (Opterr) printf("%c -- invalid option\n", thisopt) > "/dev/stderr" if (_opti >= length(argv[Optind])) { Optind++ _opti = 0 } else _opti++ return "?" }
The _opti
variable tracks the position in the current command line
argument (argv[Optind]
). In the case that multiple options were
grouped together with one `-' (e.g., `-abx'), it is necessary
to return them to the user one at a time.
If _opti
is equal to zero, it is set to two, the index in the string
of the next character to look at (we skip the `-', which is at position
one). The variable thisopt
holds the character, obtained with
substr
. It is saved in Optopt
for the main program to use.
If thisopt
is not in the options
string, then it is an
invalid option. If Opterr
is non-zero, getopt
prints an error
message on the standard error that is similar to the message from the C
version of getopt
.
Since the option is invalid, it is necessary to skip it and move on to the
next option character. If _opti
is greater than or equal to the
length of the current command line argument, then it is necessary to move on
to the next one, so Optind
is incremented and _opti
is reset
to zero. Otherwise, Optind
is left alone and _opti
is merely
incremented.
In any case, since the option was invalid, getopt
returns `?'.
The main program can examine Optopt
if it needs to know what the
invalid option letter actually was.
if (substr(options, i + 1, 1) == ":") { # get option argument if (length(substr(argv[Optind], _opti + 1)) > 0) Optarg = substr(argv[Optind], _opti + 1) else Optarg = argv[++Optind] _opti = 0 } else Optarg = ""
If the option requires an argument, the option letter is followed by a colon
in the options
string. If there are remaining characters in the
current command line argument (argv[Optind]
), then the rest of that
string is assigned to Optarg
. Otherwise, the next command line
argument is used (`-xFOO' vs. `-x FOO'). In either case,
_opti
is reset to zero, since there are no more characters left to
examine in the current command line argument.
if (_opti == 0 || _opti >= length(argv[Optind])) { Optind++ _opti = 0 } else _opti++ return thisopt }
Finally, if _opti
is either zero or greater than the length of the
current command line argument, it means this element in argv
is
through being processed, so Optind
is incremented to point to the
next element in argv
. If neither condition is true, then only
_opti
is incremented, so that the next option letter can be processed
on the next call to getopt
.
BEGIN { Opterr = 1 # default is to diagnose Optind = 1 # skip ARGV[0] # test program if (_getopt_test) { while ((_go_c = getopt(ARGC, ARGV, "ab:cd")) != -1) printf("c = <%c>, optarg = <%s>\n", _go_c, Optarg) printf("non-option arguments:\n") for (; Optind < ARGC; Optind++) printf("\tARGV[%d] = <%s>\n", Optind, ARGV[Optind]) } }
The BEGIN
rule initializes both Opterr
and Optind
to one.
Opterr
is set to one, since the default behavior is for getopt
to print a diagnostic message upon seeing an invalid option. Optind
is set to one, since there's no reason to look at the program name, which is
in ARGV[0]
.
The rest of the BEGIN
rule is a simple test program. Here is the
result of two sample runs of the test program.
$ awk -f getopt.awk -v _getopt_test=1 -- -a -cbARG bax -x -| c = <a>, optarg = <> -| c = <c>, optarg = <> -| c = <b>, optarg = <ARG> -| non-option arguments: -| ARGV[3] = <bax> -| ARGV[4] = <-x> $ awk -f getopt.awk -v _getopt_test=1 -- -a -x -- xyz abc -| c = <a>, optarg = <> error--> x -- invalid option -| c = <?>, optarg = <> -| non-option arguments: -| ARGV[4] = <xyz> -| ARGV[5] = <abc>
The first `--' terminates the arguments to awk
, so that it does
not try to interpret the `-a' etc. as its own options.
Several of the sample programs presented in
section Practical awk
Programs,
use getopt
to process their arguments.
The `/dev/user' special file
(see section Special File Names in gawk
)
provides access to the current user's real and effective user and group id
numbers, and if available, the user's supplementary group set.
However, since these are numbers, they do not provide very useful
information to the average user. There needs to be some way to find the
user information associated with the user and group numbers. This
section presents a suite of functions for retrieving information from the
user database. See section Reading the Group Database,
for a similar suite that retrieves information from the group database.
The POSIX standard does not define the file where user information is
kept. Instead, it provides the <pwd.h>
header file
and several C language subroutines for obtaining user information.
The primary function is getpwent
, for "get password entry."
The "password" comes from the original user database file,
`/etc/passwd', which kept user information, along with the
encrypted passwords (hence the name).
While an awk
program could simply read `/etc/passwd' directly
(the format is well known), because of the way password
files are handled on networked systems,
this file may not contain complete information about the system's set of users.
To be sure of being
able to produce a readable, complete version of the user database, it is
necessary to write a small C program that calls getpwent
.
getpwent
is defined to return a pointer to a struct passwd
.
Each time it is called, it returns the next entry in the database.
When there are no more entries, it returns NULL
, the null pointer.
When this happens, the C program should call endpwent
to close the
database.
Here is pwcat
, a C program that "cats" the password database.
/* * pwcat.c * * Generate a printable version of the password database * * Arnold Robbins * arnold@gnu.ai.mit.edu * May 1993 * Public Domain */ #include <stdio.h> #include <pwd.h> int main(argc, argv) int argc; char **argv; { struct passwd *p; while ((p = getpwent()) != NULL) printf("%s:%s:%d:%d:%s:%s:%s\n", p->pw_name, p->pw_passwd, p->pw_uid, p->pw_gid, p->pw_gecos, p->pw_dir, p->pw_shell); endpwent(); exit(0); }
If you don't understand C, don't worry about it.
The output from pwcat
is the user database, in the traditional
`/etc/passwd' format of colon-separated fields. The fields are:
$HOME
).
Here are a few lines representative of pwcat
's output.
$ pwcat -| root:3Ov02d5VaUPB6:0:1:Operator:/:/bin/sh -| nobody:*:65534:65534::/: -| daemon:*:1:1::/: -| sys:*:2:2::/:/bin/csh -| bin:*:3:3::/bin: -| arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh -| miriam:yxaay:112:10:Miriam Robbins:/home/miriam:/bin/sh ...
With that introduction, here is a group of functions for getting user information. There are several functions here, corresponding to the C functions of the same name.
# passwd.awk -- access password file information # Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain # May 1993 BEGIN { # tailor this to suit your system _pw_awklib = "/usr/local/libexec/awk/" } function _pw_init( oldfs, oldrs, olddol0, pwcat) { if (_pw_inited) return oldfs = FS oldrs = RS olddol0 = $0 FS = ":" RS = "\n" pwcat = _pw_awklib "pwcat" while ((pwcat | getline) > 0) { _pw_byname[$1] = $0 _pw_byuid[$3] = $0 _pw_bycount[++_pw_total] = $0 } close(pwcat) _pw_count = 0 _pw_inited = 1 FS = oldfs RS = oldrs $0 = olddol0 }
The BEGIN
rule sets a private variable to the directory where
pwcat
is stored. Since it is used to help out an awk
library
routine, we have chosen to put it in `/usr/local/libexec/awk'.
You might want it to be in a different directory on your system.
The function _pw_init
keeps three copies of the user information
in three associative arrays. The arrays are indexed by user name
(_pw_byname
), by user-id number (_pw_byuid
), and by order of
occurrence (_pw_bycount
).
The variable _pw_inited
is used for efficiency; _pw_init
only
needs to be called once.
Since this function uses getline
to read information from
pwcat
, it first saves the values of FS
, RS
, and
$0
. Doing so is necessary, since these functions could be called
from anywhere within a user's program, and the user may have his or her
own values for FS
and RS
.
The main part of the function uses a loop to read database lines, split
the line into fields, and then store the line into each array as necessary.
When the loop is done, _pw_init
cleans up by closing the pipeline,
setting _pw_inited
to one, and restoring FS
, RS
, and
$0
. The use of _pw_count
will be explained below.
function getpwnam(name) { _pw_init() if (name in _pw_byname) return _pw_byname[name] return "" }
The getpwnam
function takes a user name as a string argument. If that
user is in the database, it returns the appropriate line. Otherwise it
returns the null string.
function getpwuid(uid) { _pw_init() if (uid in _pw_byuid) return _pw_byuid[uid] return "" }
Similarly,
the getpwuid
function takes a user-id number argument. If that
user number is in the database, it returns the appropriate line. Otherwise it
returns the null string.
function getpwent() { _pw_init() if (_pw_count < _pw_total) return _pw_bycount[++_pw_count] return "" }
The getpwent
function simply steps through the database, one entry at
a time. It uses _pw_count
to track its current position in the
_pw_bycount
array.
function endpwent() { _pw_count = 0 }
The endpwent
function resets _pw_count
to zero, so that
subsequent calls to getpwent
will start over again.
A conscious design decision in this suite is that each subroutine calls
_pw_init
to initialize the database arrays. The overhead of running
a separate process to generate the user database, and the I/O to scan it,
will only be incurred if the user's main program actually calls one of these
functions. If this library file is loaded along with a user's program, but
none of the routines are ever called, then there is no extra run-time overhead.
(The alternative would be to move the body of _pw_init
into a
BEGIN
rule, which would always run pwcat
. This simplifies the
code but runs an extra process that may never be needed.)
In turn, calling _pw_init
is not too expensive, since the
_pw_inited
variable keeps the program from reading the data more than
once. If you are worried about squeezing every last cycle out of your
awk
program, the check of _pw_inited
could be moved out of
_pw_init
and duplicated in all the other functions. In practice,
this is not necessary, since most awk
programs are I/O bound, and it
would clutter up the code.
The id
program in section Printing Out User Information,
uses these functions.
Much of the discussion presented in
section Reading the User Database,
applies to the group database as well. Although there has traditionally
been a well known file, `/etc/group', in a well known format, the POSIX
standard only provides a set of C library routines
(<grp.h>
and getgrent
)
for accessing the information.
Even though this file may exist, it likely does not have
complete information. Therefore, as with the user database, it is necessary
to have a small C program that generates the group database as its output.
Here is grcat
, a C program that "cats" the group database.
/* * grcat.c * * Generate a printable version of the group database * * Arnold Robbins, arnold@gnu.ai.mit.edu * May 1993 * Public Domain */ #include <stdio.h> #include <grp.h> int main(argc, argv) int argc; char **argv; { struct group *g; int i; while ((g = getgrent()) != NULL) { printf("%s:%s:%d:", g->gr_name, g->gr_passwd, g->gr_gid); for (i = 0; g->gr_mem[i] != NULL; i++) { printf("%s", g->gr_mem[i]); if (g->gr_mem[i+1] != NULL) putchar(','); } putchar('\n'); } endgrent(); exit(0); }
Each line in the group database represent one group. The fields are separated with colons, and represent the following information.
$5
through $NF
.
(Note that `/dev/user' is a gawk
extension;
see section Special File Names in gawk
.)
Here is what running grcat
might produce:
$ grcat -| wheel:*:0:arnold -| nogroup:*:65534: -| daemon:*:1: -| kmem:*:2: -| staff:*:10:arnold,miriam,andy -| other:*:20: ...
Here are the functions for obtaining information from the group database. There are several, modeled after the C library functions of the same names.
# group.awk -- functions for dealing with the group file # Arnold Robbins, arnold@gnu.ai.mit.edu, Public Domain # May 1993 BEGIN \ { # Change to suit your system _gr_awklib = "/usr/local/libexec/awk/" } function _gr_init( oldfs, oldrs, olddol0, grcat, n, a, i) { if (_gr_inited) return oldfs = FS oldrs = RS olddol0 = $0 FS = ":" RS = "\n" grcat = _gr_awklib "grcat" while ((grcat | getline) > 0) { if ($1 in _gr_byname) _gr_byname[$1] = _gr_byname[$1] "," $4 else _gr_byname[$1] = $0 if ($3 in _gr_bygid) _gr_bygid[$3] = _gr_bygid[$3] "," $4 else _gr_bygid[$3] = $0 n = split($4, a, "[ \t]*,[ \t]*") for (i = 1; i <= n; i++) if (a[i] in _gr_groupsbyuser) _gr_groupsbyuser[a[i]] = \ _gr_groupsbyuser[a[i]] " " $1 else _gr_groupsbyuser[a[i]] = $1 _gr_bycount[++_gr_count] = $0 } close(grcat) _gr_count = 0 _gr_inited++ FS = oldfs RS = oldrs $0 = olddol0 }
The BEGIN
rule sets a private variable to the directory where
grcat
is stored. Since it is used to help out an awk
library
routine, we have chosen to put it in `/usr/local/libexec/awk'. You might
want it to be in a different directory on your system.
These routines follow the same general outline as the user database routines
(see section Reading the User Database).
The _gr_inited
variable is used to
ensure that the database is scanned no more than once.
The _gr_init
function first saves FS
, RS
, and
$0
, and then sets FS
and RS
to the correct values for
scanning the group information.
The group information is stored is several associative arrays.
The arrays are indexed by group name (_gr_byname
), by group-id number
(_gr_bygid
), and by position in the database (_gr_bycount
).
There is an additional array indexed by user name (_gr_groupsbyuser
),
that is a space separated list of groups that each user belongs to.
Unlike the user database, it is possible to have multiple records in the database for the same group. This is common when a group has a large number of members. Such a pair of entries might look like:
tvpeople:*:101:johny,jay,arsenio tvpeople:*:101:david,conan,tom,joan
For this reason, _gr_init
looks to see if a group name or
group-id number has already been seen. If it has, then the user names are
simply concatenated onto the previous list of users. (There is actually a
subtle problem with the code presented above. Suppose that
the first time there were no names. This code adds the names with
a leading comma. It also doesn't check that there is a $4
.)
Finally, _gr_init
closes the pipeline to grcat
, restores
FS
, RS
, and $0
, initializes _gr_count
to zero
(it is used later), and makes _gr_inited
non-zero.
function getgrnam(group) { _gr_init() if (group in _gr_byname) return _gr_byname[group] return "" }
The getgrnam
function takes a group name as its argument, and if that
group exists, it is returned. Otherwise, getgrnam
returns the null
string.
function getgrgid(gid) { _gr_init() if (gid in _gr_bygid) return _gr_bygid[gid] return "" }
The getgrgid
function is similar, it takes a numeric group-id, and
looks up the information associated with that group-id.
function getgruser(user) { _gr_init() if (user in _gr_groupsbyuser) return _gr_groupsbyuser[user] return "" }
The getgruser
function does not have a C counterpart. It takes a
user name, and returns the list of groups that have the user as a member.
function getgrent() { _gr_init() if (++gr_count in _gr_bycount) return _gr_bycount[_gr_count] return "" }
The getgrent
function steps through the database one entry at a time.
It uses _gr_count
to track its position in the list.
function endgrent() { _gr_count = 0 }
endgrent
resets _gr_count
to zero so that getgrent
can
start over again.
As with the user database routines, each function calls _gr_init
to
initialize the arrays. Doing so only incurs the extra overhead of running
grcat
if these functions are used (as opposed to moving the body of
_gr_init
into a BEGIN
rule).
Most of the work is in scanning the database and building the various
associative arrays. The functions that the user calls are themselves very
simple, relying on awk
's associative arrays to do work.
The id
program in section Printing Out User Information,
uses these functions.
Due to the way the awk
language evolved, variables are either
global (usable by the entire program), or local (usable just by
a specific function). There is no intermediate state analogous to
static
variables in C.
Library functions often need to have global variables that they can use to
preserve state information between calls to the function. For example,
getopt
's variable _opti
(see section Processing Command Line Options),
and the _tm_months
array used by mktime
(see section Turning Dates Into Timestamps).
Such variables are called private, since the only functions that need to
use them are the ones in the library.
When writing a library function, you should try to choose names for your private variables so that they will not conflict with any variables used by either another library function or a user's main program. For example, a name like `i' or `j' is not a good choice, since user programs often use variable names like these for their own purposes.
The example programs shown in this chapter all start the names of their private variables with an underscore (`_'). Users generally don't use leading underscores in their variable names, so this convention immediately decreases the chances that the variable name will be accidentally shared with the user's program.
In addition, several of the library functions use a prefix that helps
indicate what function or set of functions uses the variables. For example,
_tm_months
in mktime
(see section Turning Dates Into Timestamps), and
_pw_byname
in the user data base routines
(see section Reading the User Database).
This convention is recommended, since it even further decreases the chance
of inadvertent conflict among variable names.
Note that this convention can be used equally well both for variable names
and for private function names too.
While I could have re-written all the library routines to use this
convention, I did not do so, in order to show how my own awk
programming style has evolved, and to provide some basis for this
discussion.
As a final note on variable naming, if a function makes global variables
available for use by a main program, it is a good convention to start that
variable's name with a capital letter.
For example, getopt
's Opterr
and Optind
variables
(see section Processing Command Line Options).
The leading capital letter indicates that it is global, while the fact that
the variable name is not all capital letters indicates that the variable is
not one of awk
's built-in variables, like FS
.
It is also important that all variables in library functions that do not need to save state are in fact declared local. If this is not done, the variable could accidentally be used in the user's program, leading to bugs that are very difficult to track down.
function lib_func(x, y, l1, l2) { ... use variable some_var # some_var could be local ... # but is not by oversight }
A different convention, common in the Tcl community, is to use a single
associative array to hold the values needed by the library function(s), or
"package." This significantly decreases the number of actual global names
in use. For example, the functions described in
section Reading the User Database,
might have used PW_data["inited"]
, PW_data["total"]
,
PW_data["count"]
and PW_data["awklib"]
, instead of
_pw_inited
, _pw_awklib
, _pw_total
,
and _pw_count
.
The conventions presented in this section are exactly that, conventions. You are not required to write your programs this way, we merely recommend that you do so.