As you have already seen, each awk
statement consists of
a pattern with an associated action. This chapter describes how
you build patterns and actions.
Patterns in awk
control the execution of rules: a rule is
executed when its pattern matches the current input record. This
section explains all about how to write patterns.
Here is a summary of the types of patterns supported in awk
.
/regular expression/
expression
pat1, pat2
BEGIN
END
awk
program.
(See section The BEGIN
and END
Special Patterns.)
empty
We have been using regular expressions as patterns since our early examples. This kind of pattern is simply a regexp constant in the pattern part of a rule. Its meaning is `$0 ~ /pattern/'. The pattern matches when the input record matches the regexp. For example:
/foo|bar|baz/ { buzzwords++ } END { print buzzwords, "buzzwords seen" }
Any awk
expression is valid as an awk
pattern.
Then the pattern matches if the expression's value is non-zero (if a
number) or non-null (if a string).
The expression is reevaluated each time the rule is tested against a new
input record. If the expression uses fields such as $1
, the
value depends directly on the new input record's text; otherwise, it
depends only on what has happened so far in the execution of the
awk
program, but that may still be useful.
A very common kind of expression used as a pattern is the comparison expression, using the comparison operators described in section Variable Typing and Comparison Expressions.
Regexp matching and non-matching are also very common expressions.
The left operand of the `~' and `!~' operators is a string.
The right operand is either a constant regular expression enclosed in
slashes (/regexp/
), or any expression, whose string value
is used as a dynamic regular expression
(see section Using Dynamic Regexps).
The following example prints the second field of each input record whose first field is precisely `foo'.
$ awk '$1 == "foo" { print $2 }' BBS-list
(There is no output, since there is no BBS site named "foo".) Contrast this with the following regular expression match, which would accept any record with a first field that contains `foo':
$ awk '$1 ~ /foo/ { print $2 }' BBS-list -| 555-1234 -| 555-6699 -| 555-6480 -| 555-2127
Boolean expressions are also commonly used as patterns. Whether the pattern matches an input record depends on whether its subexpressions match.
For example, the following command prints all records in `BBS-list' that contain both `2400' and `foo'.
$ awk '/2400/ && /foo/' BBS-list -| fooey 555-1234 2400/1200/300 B
The following command prints all records in `BBS-list' that contain either `2400' or `foo', or both.
$ awk '/2400/ || /foo/' BBS-list -| alpo-net 555-3412 2400/1200/300 A -| bites 555-1675 2400/1200/300 A -| fooey 555-1234 2400/1200/300 B -| foot 555-6699 1200/300 B -| macfoo 555-6480 1200/300 A -| sdace 555-3430 2400/1200/300 A -| sabafoo 555-2127 1200/300 C
The following command prints all records in `BBS-list' that do not contain the string `foo'.
$ awk '! /foo/' BBS-list -| aardvark 555-5553 1200/300 B -| alpo-net 555-3412 2400/1200/300 A -| barfly 555-7685 1200/300 A -| bites 555-1675 2400/1200/300 A -| camelot 555-0542 300 C -| core 555-2912 1200/300 C -| sdace 555-3430 2400/1200/300 A
The subexpressions of a boolean operator in a pattern can be constant regular
expressions, comparisons, or any other awk
expressions. Range
patterns are not expressions, so they cannot appear inside boolean
patterns. Likewise, the special patterns BEGIN
and END
,
which never match any input record, are not expressions and cannot
appear inside boolean patterns.
A regexp constant as a pattern is also a special case of an expression
pattern. /foo/
as an expression has the value one if `foo'
appears in the current input record; thus, as a pattern, /foo/
matches any record containing `foo'.
A range pattern is made of two patterns separated by a comma, of the form `begpat, endpat'. It matches ranges of consecutive input records. The first pattern, begpat, controls where the range begins, and the second one, endpat, controls where it ends. For example,
awk '$1 == "on", $1 == "off"'
prints every record between `on'/`off' pairs, inclusive.
A range pattern starts out by matching begpat against every input record; when a record matches begpat, the range pattern becomes turned on. The range pattern matches this record. As long as it stays turned on, it automatically matches every input record read. It also matches endpat against every input record; when that succeeds, the range pattern is turned off again for the following record. Then it goes back to checking begpat against each record.
The record that turns on the range pattern and the one that turns it
off both match the range pattern. If you don't want to operate on
these records, you can write if
statements in the rule's action
to distinguish them from the records you are interested in.
It is possible for a pattern to be turned both on and off by the same record, if the record satisfies both conditions. Then the action is executed for just that record.
For example, suppose you have text between two identical markers (say
the `%' symbol) that you wish to ignore. You might try to
combine a range pattern that describes the delimited text with the
next
statement
(not discussed yet, see section The next
Statement),
which causes awk
to skip any further processing of the current
record and start over again with the next input record. Such a program
would like this:
/^%$/,/^%$/ { next } { print }
This program fails because the range pattern is both turned on and turned off by the first line with just a `%' on it. To accomplish this task, you must write the program this way, using a flag:
/^%$/ { skip = ! skip; next } skip == 1 { next } # skip lines with `skip' set
Note that in a range pattern, the `,' has the lowest precedence (is evaluated last) of all the operators. Thus, for example, the following program attempts to combine a range pattern with another, simpler test.
echo Yes | awk '/1/,/2/ || /Yes/'
The author of this program intended it to mean `(/1/,/2/) || /Yes/'.
However, awk
interprets this as `/1/, (/2/ || /Yes/)'.
This cannot be changed or worked around; range patterns do not combine
with other patterns.
BEGIN
and END
Special Patterns
BEGIN
and END
are special patterns. They are not used to
match input records. Rather, they supply start-up or
clean-up actions for your awk
script.
A BEGIN
rule is executed, once, before the first input record
has been read. An END
rule is executed, once, after all the
input has been read. For example:
$ awk ' > BEGIN { print "Analysis of \"foo\"" } > /foo/ { ++n } > END { print "\"foo\" appears " n " times." }' BBS-list -| Analysis of "foo" -| "foo" appears 4 times.
This program finds the number of records in the input file `BBS-list'
that contain the string `foo'. The BEGIN
rule prints a title
for the report. There is no need to use the BEGIN
rule to
initialize the counter n
to zero, as awk
does this
automatically (see section Variables).
The second rule increments the variable n
every time a
record containing the pattern `foo' is read. The END
rule
prints the value of n
at the end of the run.
The special patterns BEGIN
and END
cannot be used in ranges
or with boolean operators (indeed, they cannot be used with any operators).
An awk
program may have multiple BEGIN
and/or END
rules. They are executed in the order they appear, all the BEGIN
rules at start-up and all the END
rules at termination.
BEGIN
and END
rules may be intermixed with other rules.
This feature was added in the 1987 version of awk
, and is included
in the POSIX standard. The original (1978) version of awk
required you to put the BEGIN
rule at the beginning of the
program, and the END
rule at the end, and only allowed one of
each. This is no longer required, but it is a good idea in terms of
program organization and readability.
Multiple BEGIN
and END
rules are useful for writing
library functions, since each library file can have its own BEGIN
and/or
END
rule to do its own initialization and/or cleanup. Note that
the order in which library functions are named on the command line
controls the order in which their BEGIN
and END
rules are
executed. Therefore you have to be careful to write such rules in
library files so that the order in which they are executed doesn't matter.
See section Command Line Options, for more information on
using library functions.
See section A Library of awk
Functions,
for a number of useful library functions.
If an awk
program only has a BEGIN
rule, and no other
rules, then the program exits after the BEGIN
rule has been run.
(The original version of awk
used to keep reading and ignoring input
until end of file was seen.) However, if an END
rule exists,
then the input will be read, even if there are no other rules in
the program. This is necessary in case the END
rule checks the
FNR
and NR
variables (d.c.).
BEGIN
and END
rules must have actions; there is no default
action for these rules since there is no current record when they run.
BEGIN
and END
Rules
There are several (sometimes subtle) issues involved when doing I/O
from a BEGIN
or END
rule.
The first has to do with the value of $0
in a BEGIN
rule. Since BEGIN
rules are executed before any input is read,
there simply is no input record, and therefore no fields, when
executing BEGIN
rules. References to $0
and the fields
yield a null string or zero, depending upon the context. One way
to give $0
a real value is to execute a getline
command
without a variable (see section Explicit Input with getline
).
Another way is to simply assign a value to it.
The second point is similar to the first, but from the other direction.
Inside an END
rule, what is the value of $0
and NF
?
Traditionally, due largely to implementation issues, $0
and
NF
were undefined inside an END
rule.
The POSIX standard specified that NF
was available in an END
rule, containing the number of fields from the last input record.
Due most probably to an oversight, the standard does not say that $0
is also preserved, although logically one would think that it should be.
In fact, gawk
does preserve the value of $0
for use in
END
rules. Be aware, however, that Unix awk
, and possibly
other implementations, do not.
The third point follows from the first two. What is the meaning of
`print' inside a BEGIN
or END
rule? The meaning is
the same as always, `print $0'. If $0
is the null string,
then this prints an empty line. Many long time awk
programmers
use `print' in BEGIN
and END
rules, to mean
`print ""', relying on $0
being null. While you might
generally get away with this in BEGIN
rules, in gawk
at
least, it is a very bad idea in END
rules. It is also poor
style, since if you want an empty line in the output, you
should say so explicitly in your program.
An empty (i.e. non-existent) pattern is considered to match every input record. For example, the program:
awk '{ print $1 }' BBS-list
prints the first field of every record.
An awk
program or script consists of a series of
rules and function definitions, interspersed. (Functions are
described later. See section User-defined Functions.)
A rule contains a pattern and an action, either of which (but not
both) may be
omitted. The purpose of the action is to tell awk
what to do
once a match for the pattern is found. Thus, in outline, an awk
program generally looks like this:
[pattern] [{ action }] [pattern] [{ action }] ... function name(args) { ... } ...
An action consists of one or more awk
statements, enclosed
in curly braces (`{' and `}'). Each statement specifies one
thing to be done. The statements are separated by newlines or
semicolons.
The curly braces around an action must be used even if the action contains only one statement, or even if it contains no statements at all. However, if you omit the action entirely, omit the curly braces as well. An omitted action is equivalent to `{ print $0 }'.
/foo/ { } # match foo, do nothing - empty action /foo/ # match foo, print the record - omitted action
Here are the kinds of statements supported in awk
:
awk
programs. The awk
language gives you C-like constructs
(if
, for
, while
, and do
) as well as a few
special ones (see section Control Statements in Actions).
if
, while
, do
or for
statement.
getline
command
(see section Explicit Input with getline
), the next
statement (see section The next
Statement),
and the nextfile
statement
(see section The nextfile
Statement).
print
and printf
.
See section Printing Output.
delete
Statement.
The next chapter covers control statements in detail.