Patterns and Actions

As you have already seen, each awk statement consists of a pattern with an associated action. This chapter describes how you build patterns and actions.

Pattern Elements

Patterns in awk control the execution of rules: a rule is executed when its pattern matches the current input record. This section explains all about how to write patterns.

Kinds of Patterns

Here is a summary of the types of patterns supported in awk.

/regular expression/: A regular expression as a pattern. It matches when the text of the input record fits the regular expression. (See section Regular Expressions.)
expression: A single expression. It matches when its value is non-zero (if a number) or non-null (if a string). (See section Expressions as Patterns.)
pat1, pat2: A pair of patterns separated by a comma, specifying a range of records. The range includes both the initial record that matches pat1, and the final record that matches pat2. (See section Specifying Record Ranges with Patterns.)
BEGIN
END: Special patterns for you to supply start-up or clean-up actions for your awk program. (See section The BEGIN and END Special Patterns.)
empty: The empty pattern matches every input record. (See section The Empty Pattern.)

Regular Expressions as Patterns

We have been using regular expressions as patterns since our early examples. This kind of pattern is simply a regexp constant in the pattern part of a rule. Its meaning is `$0 ~ /pattern/'. The pattern matches when the input record matches the regexp. For example:

/foo|bar|baz/  { buzzwords++ }
END            { print buzzwords, "buzzwords seen" }

Expressions as Patterns

Any awk expression is valid as an awk pattern. Then the pattern matches if the expression's value is non-zero (if a number) or non-null (if a string).

The expression is reevaluated each time the rule is tested against a new input record. If the expression uses fields such as $1, the value depends directly on the new input record's text; otherwise, it depends only on what has happened so far in the execution of the awk program, but that may still be useful.

A very common kind of expression used as a pattern is the comparison expression, using the comparison operators described in section Variable Typing and Comparison Expressions.

Regexp matching and non-matching are also very common expressions. The left operand of the `~' and `!~' operators is a string. The right operand is either a constant regular expression enclosed in slashes (/regexp/), or any expression, whose string value is used as a dynamic regular expression (see section Using Dynamic Regexps).

The following example prints the second field of each input record whose first field is precisely `foo'.

$ awk '$1 == "foo" { print $2 }' BBS-list

(There is no output, since there is no BBS site named "foo".) Contrast this with the following regular expression match, which would accept any record with a first field that contains `foo':

$ awk '$1 ~ /foo/ { print $2 }' BBS-list
-| 555-1234
-| 555-6699
-| 555-6480
-| 555-2127

Boolean expressions are also commonly used as patterns. Whether the pattern matches an input record depends on whether its subexpressions match.

For example, the following command prints all records in `BBS-list' that contain both `2400' and `foo'.

$ awk '/2400/ && /foo/' BBS-list
-| fooey        555-1234     2400/1200/300     B

The following command prints all records in `BBS-list' that contain either `2400' or `foo', or both.

$ awk '/2400/ || /foo/' BBS-list
-| alpo-net     555-3412     2400/1200/300     A
-| bites        555-1675     2400/1200/300     A
-| fooey        555-1234     2400/1200/300     B
-| foot         555-6699     1200/300          B
-| macfoo       555-6480     1200/300          A
-| sdace        555-3430     2400/1200/300     A
-| sabafoo      555-2127     1200/300          C

The following command prints all records in `BBS-list' that do not contain the string `foo'.

$ awk '! /foo/' BBS-list
-| aardvark     555-5553     1200/300          B
-| alpo-net     555-3412     2400/1200/300     A
-| barfly       555-7685     1200/300          A
-| bites        555-1675     2400/1200/300     A
-| camelot      555-0542     300               C
-| core         555-2912     1200/300          C
-| sdace        555-3430     2400/1200/300     A

The subexpressions of a boolean operator in a pattern can be constant regular expressions, comparisons, or any other awk expressions. Range patterns are not expressions, so they cannot appear inside boolean patterns. Likewise, the special patterns BEGIN and END, which never match any input record, are not expressions and cannot appear inside boolean patterns.

A regexp constant as a pattern is also a special case of an expression pattern. /foo/ as an expression has the value one if `foo' appears in the current input record; thus, as a pattern, /foo/ matches any record containing `foo'.

Specifying Record Ranges with Patterns

A range pattern is made of two patterns separated by a comma, of the form `begpat, endpat'. It matches ranges of consecutive input records. The first pattern, begpat, controls where the range begins, and the second one, endpat, controls where it ends. For example,

awk '$1 == "on", $1 == "off"'

prints every record between `on'/`off' pairs, inclusive.

A range pattern starts out by matching begpat against every input record; when a record matches begpat, the range pattern becomes turned on. The range pattern matches this record. As long as it stays turned on, it automatically matches every input record read. It also matches endpat against every input record; when that succeeds, the range pattern is turned off again for the following record. Then it goes back to checking begpat against each record.

The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them from the records you are interested in.

It is possible for a pattern to be turned both on and off by the same record, if the record satisfies both conditions. Then the action is executed for just that record.

For example, suppose you have text between two identical markers (say the `%' symbol) that you wish to ignore. You might try to combine a range pattern that describes the delimited text with the next statement (not discussed yet, see section The next Statement), which causes awk to skip any further processing of the current record and start over again with the next input record. Such a program would like this:

/^%$/,/^%$/    { next }
               { print }

This program fails because the range pattern is both turned on and turned off by the first line with just a `%' on it. To accomplish this task, you must write the program this way, using a flag:

/^%$/     { skip = ! skip; next }
skip == 1 { next } # skip lines with `skip' set

Note that in a range pattern, the `,' has the lowest precedence (is evaluated last) of all the operators. Thus, for example, the following program attempts to combine a range pattern with another, simpler test.

echo Yes | awk '/1/,/2/ || /Yes/'

The author of this program intended it to mean `(/1/,/2/) || /Yes/'. However, awk interprets this as `/1/, (/2/ || /Yes/)'. This cannot be changed or worked around; range patterns do not combine with other patterns.

The `BEGIN` and `END` Special Patterns

BEGIN and END are special patterns. They are not used to match input records. Rather, they supply start-up or clean-up actions for your awk script.

Startup and Cleanup Actions

A BEGIN rule is executed, once, before the first input record has been read. An END rule is executed, once, after all the input has been read. For example:

$ awk '
> BEGIN { print "Analysis of \"foo\"" }
> /foo/ { ++n }
> END   { print "\"foo\" appears " n " times." }' BBS-list
-| Analysis of "foo"
-| "foo" appears 4 times.

This program finds the number of records in the input file `BBS-list' that contain the string `foo'. The BEGIN rule prints a title for the report. There is no need to use the BEGIN rule to initialize the counter n to zero, as awk does this automatically (see section Variables).

The second rule increments the variable n every time a record containing the pattern `foo' is read. The END rule prints the value of n at the end of the run.

The special patterns BEGIN and END cannot be used in ranges or with boolean operators (indeed, they cannot be used with any operators).

An awk program may have multiple BEGIN and/or END rules. They are executed in the order they appear, all the BEGIN rules at start-up and all the END rules at termination. BEGIN and END rules may be intermixed with other rules. This feature was added in the 1987 version of awk, and is included in the POSIX standard. The original (1978) version of awk required you to put the BEGIN rule at the beginning of the program, and the END rule at the end, and only allowed one of each. This is no longer required, but it is a good idea in terms of program organization and readability.

Multiple BEGIN and END rules are useful for writing library functions, since each library file can have its own BEGIN and/or END rule to do its own initialization and/or cleanup. Note that the order in which library functions are named on the command line controls the order in which their BEGIN and END rules are executed. Therefore you have to be careful to write such rules in library files so that the order in which they are executed doesn't matter. See section Command Line Options, for more information on using library functions. See section A Library of awk Functions, for a number of useful library functions.

If an awk program only has a BEGIN rule, and no other rules, then the program exits after the BEGIN rule has been run. (The original version of awk used to keep reading and ignoring input until end of file was seen.) However, if an END rule exists, then the input will be read, even if there are no other rules in the program. This is necessary in case the END rule checks the FNR and NR variables (d.c.).

BEGIN and END rules must have actions; there is no default action for these rules since there is no current record when they run.

Input/Output from `BEGIN` and `END` Rules

There are several (sometimes subtle) issues involved when doing I/O from a BEGIN or END rule.

The first has to do with the value of $0 in a BEGIN rule. Since BEGIN rules are executed before any input is read, there simply is no input record, and therefore no fields, when executing BEGIN rules. References to $0 and the fields yield a null string or zero, depending upon the context. One way to give $0 a real value is to execute a getline command without a variable (see section Explicit Input with getline). Another way is to simply assign a value to it.

The second point is similar to the first, but from the other direction. Inside an END rule, what is the value of $0 and NF? Traditionally, due largely to implementation issues, $0 and NF were undefined inside an END rule. The POSIX standard specified that NF was available in an END rule, containing the number of fields from the last input record. Due most probably to an oversight, the standard does not say that $0 is also preserved, although logically one would think that it should be. In fact, gawk does preserve the value of $0 for use in END rules. Be aware, however, that Unix awk, and possibly other implementations, do not.

The third point follows from the first two. What is the meaning of `print' inside a BEGIN or END rule? The meaning is the same as always, `print $0'. If $0 is the null string, then this prints an empty line. Many long time awk programmers use `print' in BEGIN and END rules, to mean `print ""', relying on $0 being null. While you might generally get away with this in BEGIN rules, in gawk at least, it is a very bad idea in END rules. It is also poor style, since if you want an empty line in the output, you should say so explicitly in your program.

The Empty Pattern

An empty (i.e. non-existent) pattern is considered to match every input record. For example, the program:

awk '{ print $1 }' BBS-list

prints the first field of every record.

Overview of Actions

An awk program or script consists of a series of rules and function definitions, interspersed. (Functions are described later. See section User-defined Functions.)

A rule contains a pattern and an action, either of which (but not both) may be omitted. The purpose of the action is to tell awk what to do once a match for the pattern is found. Thus, in outline, an awk program generally looks like this:

[pattern] [{ action }]
[pattern] [{ action }]
...
function name(args) { ... }
...

An action consists of one or more awk statements, enclosed in curly braces (`{' and `}'). Each statement specifies one thing to be done. The statements are separated by newlines or semicolons.

The curly braces around an action must be used even if the action contains only one statement, or even if it contains no statements at all. However, if you omit the action entirely, omit the curly braces as well. An omitted action is equivalent to `{ print $0 }'.

/foo/  { }  # match foo, do nothing - empty action
/foo/       # match foo, print the record - omitted action

Here are the kinds of statements supported in awk:

Expressions, which can call functions or assign values to variables (see section Expressions). Executing this kind of statement simply computes the value of the expression. This is useful when the expression has side effects (see section Assignment Expressions).
Control statements, which specify the control flow of awk programs. The awk language gives you C-like constructs (if, for, while, and do) as well as a few special ones (see section Control Statements in Actions).
Compound statements, which consist of one or more statements enclosed in curly braces. A compound statement is used in order to put several statements together in the body of an if, while, do or for statement.
Input statements, using the getline command (see section Explicit Input with getline), the next statement (see section The next Statement), and the nextfile statement (see section The nextfile Statement).
Output statements, print and printf. See section Printing Output.
Deletion statements, for deleting array elements. See section The delete Statement.

The next chapter covers control statements in detail.

Mini annuaire : Gawk

Youhp3	Youpee est un preprocesseur HTML pour vous simplifier toutes les tâches répétitives dans la création d'un site web. Salemioche.net utilise trés largement ses possibilités
cygwin	le compilateur gcc sous windows ainsi que tous les outils unix (awk, grep, sed, bash, ksh ...)