Reading Input Files

Reading Input Files

In the typical awk program, all input is read either from the standard input (by default the keyboard, but often a pipe from another command) or from files whose names you specify on the awk command line. If you specify input files, awk reads them in order, reading all the data from one before going on to the next. The name of the current input file can be found in the built-in variable FILENAME (see section Built-in Variables).

The input is read in units called records, and processed by the rules of your program one record at a time. By default, each record is one line. Each record is automatically split into chunks called fields. This makes it more convenient for programs to work on the parts of a record.

On rare occasions you will need to use the getline command. The getline command is valuable, both because it can do explicit input from any number of files, and because the files used with it do not have to be named on the awk command line (see section Explicit Input with getline).

How Input is Split into Records

The awk utility divides the input for your awk program into records and fields. Records are separated by a character called the record separator. By default, the record separator is the newline character. This is why records are, by default, single lines. You can use a different character for the record separator by assigning the character to the built-in variable RS.

You can change the value of RS in the awk program, like any other variable, with the assignment operator, `=' (see section Assignment Expressions). The new record-separator character should be enclosed in quotation marks, which indicate a string constant. Often the right time to do this is at the beginning of execution, before any input has been processed, so that the very first record will be read with the proper separator. To do this, use the special BEGIN pattern (see section The BEGIN and END Special Patterns). For example:

awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list

changes the value of RS to "/", before reading any input. This is a string whose first character is a slash; as a result, records are separated by slashes. Then the input file is read, and the second rule in the awk program (the action with no pattern) prints each record. Since each print statement adds a newline at the end of its output, the effect of this awk program is to copy the input with each slash changed to a newline. Here are the results of running the program on `BBS-list':

$ awk 'BEGIN { RS = "/" } ; { print $0 }' BBS-list
-| aardvark     555-5553     1200
-| 300          B
-| alpo-net     555-3412     2400
-| 1200
-| 300     A
-| barfly       555-7685     1200
-| 300          A
-| bites        555-1675     2400
-| 1200
-| 300     A
-| camelot      555-0542     300               C
-| core         555-2912     1200
-| 300          C
-| fooey        555-1234     2400
-| 1200
-| 300     B
-| foot         555-6699     1200
-| 300          B
-| macfoo       555-6480     1200
-| 300          A
-| sdace        555-3430     2400
-| 1200
-| 300     A
-| sabafoo      555-2127     1200
-| 300          C
-|

Note that the entry for the `camelot' BBS is not split. In the original data file (see section Data Files for the Examples), the line looks like this:

camelot      555-0542     300               C

It only has one baud rate; there are no slashes in the record.

Another way to change the record separator is on the command line, using the variable-assignment feature (see section Other Command Line Arguments).

awk '{ print $0 }' RS="/" BBS-list

This sets RS to `/' before processing `BBS-list'.

Using an unusual character such as `/' for the record separator produces correct behavior in the vast majority of cases. However, the following (extreme) pipeline prints a surprising `1'. There is one field, consisting of a newline. The value of the built-in variable NF is the number of fields in the current record.

$ echo | awk 'BEGIN { RS = "a" } ; { print NF }'
-| 1

Reaching the end of an input file terminates the current input record, even if the last character in the file is not the character in RS (d.c.).

The empty string, "" (a string of no characters), has a special meaning as the value of RS: it means that records are separated by one or more blank lines, and nothing else. See section Multiple-Line Records, for more details.

If you change the value of RS in the middle of an awk run, the new value is used to delimit subsequent records, but the record currently being processed (and records already processed) are not affected.

After the end of the record has been determined, gawk sets the variable RT to the text in the input that matched RS.

The value of RS is in fact not limited to a one-character string. It can be any regular expression (see section Regular Expressions). In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string. This general rule is actually at work in the usual case, where RS contains just a newline: a record ends at the beginning of the next matching string (the next newline in the input) and the following record starts just after the end of this string (at the first character of the following line). The newline, since it matches RS, is not part of either record.

When RS is a single character, RT will contain the same single character. However, when RS is a regular expression, then RT becomes more useful; it contains the actual input text that matched the regular expression.

The following example illustrates both of these features. It sets RS equal to a regular expression that matches either a newline, or a series of one or more upper-case letters with optional leading and/or trailing white space (see section Regular Expressions).

$ echo record 1 AAAA record 2 BBBB record 3 |
> gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
>             { print "Record =", $0, "and RT =", RT }'
-| Record = record 1 and RT =  AAAA 
-| Record = record 2 and RT =  BBBB 
-| Record = record 3 and RT = 
-|

The final line of output has an extra blank line. This is because the value of RT is a newline, and then the print statement supplies its own terminating newline.

See section A Simple Stream Editor, for a more useful example of RS as a regexp and RT.

The use of RS as a regular expression and the RT variable are gawk extensions; they are not available in compatibility mode (see section Command Line Options). In compatibility mode, only the first character of the value of RS is used to determine the end of the record.

The awk utility keeps track of the number of records that have been read so far from the current input file. This value is stored in a built-in variable called FNR. It is reset to zero when a new file is started. Another built-in variable, NR, is the total number of input records read so far from all data files. It starts at zero but is never automatically reset to zero.

Examining Fields

When awk reads an input record, the record is automatically separated or parsed by the interpreter into chunks called fields. By default, fields are separated by whitespace, like words in a line. Whitespace in awk means any string of one or more spaces and/or tabs; other characters such as newline, formfeed, and so on, that are considered whitespace by other languages are not considered whitespace by awk.

The purpose of fields is to make it more convenient for you to refer to these pieces of the record. You don't have to use them--you can operate on the whole record if you wish--but fields are what make simple awk programs so powerful.

To refer to a field in an awk program, you use a dollar-sign, `$', followed by the number of the field you want. Thus, $1 refers to the first field, $2 to the second, and so on. For example, suppose the following is a line of input:

This seems like a pretty nice example.

Here the first field, or $1, is `This'; the second field, or $2, is `seems'; and so on. Note that the last field, $7, is `example.'. Because there is no space between the `e' and the `.', the period is considered part of the seventh field.

NF is a built-in variable whose value is the number of fields in the current record. awk updates the value of NF automatically, each time a record is read.

No matter how many fields there are, the last field in a record can be represented by $NF. So, in the example above, $NF would be the same as $7, which is `example.'. Why this works is explained below (see section Non-constant Field Numbers). If you try to reference a field beyond the last one, such as $8 when the record has only seven fields, you get the empty string.

$0, which looks like a reference to the "zeroth" field, is a special case: it represents the whole input record. $0 is used when you are not interested in fields.

Here are some more examples:

$ awk '$1 ~ /foo/ { print $0 }' BBS-list
-| fooey        555-1234     2400/1200/300     B
-| foot         555-6699     1200/300          B
-| macfoo       555-6480     1200/300          A
-| sabafoo      555-2127     1200/300          C

This example prints each record in the file `BBS-list' whose first field contains the string `foo'. The operator `~' is called a matching operator (see section How to Use Regular Expressions); it tests whether a string (here, the field $1) matches a given regular expression.

By contrast, the following example looks for `foo' in the entire record and prints the first field and the last field for each input record containing a match.

$ awk '/foo/ { print $1, $NF }' BBS-list
-| fooey B
-| foot B
-| macfoo A
-| sabafoo C

Non-constant Field Numbers

The number of a field does not need to be a constant. Any expression in the awk language can be used after a `$' to refer to a field. The value of the expression specifies the field number. If the value is a string, rather than a number, it is converted to a number. Consider this example:

awk '{ print $NR }'

Recall that NR is the number of records read so far: one in the first record, two in the second, etc. So this example prints the first field of the first record, the second field of the second record, and so on. For the twentieth record, field number 20 is printed; most likely, the record has fewer than 20 fields, so this prints a blank line.

Here is another example of using expressions as field numbers:

awk '{ print $(2*2) }' BBS-list

awk must evaluate the expression `(2*2)' and use its value as the number of the field to print. The `*' sign represents multiplication, so the expression `2*2' evaluates to four. The parentheses are used so that the multiplication is done before the `$' operation; they are necessary whenever there is a binary operator in the field-number expression. This example, then, prints the hours of operation (the fourth field) for every line of the file `BBS-list'. (All of the awk operators are listed, in order of decreasing precedence, in section Operator Precedence (How Operators Nest).)

If the field number you compute is zero, you get the entire record. Thus, $(2-2) has the same value as $0. Negative field numbers are not allowed; trying to reference one will usually terminate your running awk program. (The POSIX standard does not define what happens when you reference a negative field number. gawk will notice this and terminate your program. Other awk implementations may behave differently.)

As mentioned in section Examining Fields, the number of fields in the current record is stored in the built-in variable NF (also see section Built-in Variables). The expression $NF is not a special feature: it is the direct consequence of evaluating NF and using its value as a field number.

Changing the Contents of a Field

You can change the contents of a field as seen by awk within an awk program; this changes what awk perceives as the current input record. (The actual input is untouched; awk never modifies the input file.)

Consider this example and its output:

$ awk '{ $3 = $2 - 10; print $2, $3 }' inventory-shipped
-| 13 3
-| 15 5
-| 15 5
...

The `-' sign represents subtraction, so this program reassigns field three, $3, to be the value of field two minus ten, `$2 - 10'. (See section Arithmetic Operators.) Then field two, and the new value for field three, are printed.

In order for this to work, the text in field $2 must make sense as a number; the string of characters must be converted to a number in order for the computer to do arithmetic on it. The number resulting from the subtraction is converted back to a string of characters which then becomes field three. See section Conversion of Strings and Numbers.

When you change the value of a field (as perceived by awk), the text of the input record is recalculated to contain the new field where the old one was. Therefore, $0 changes to reflect the altered field. Thus, this program prints a copy of the input file, with 10 subtracted from the second field of each line.

$ awk '{ $2 = $2 - 10; print $0 }' inventory-shipped
-| Jan 3 25 15 115
-| Feb 5 32 24 226
-| Mar 5 24 34 228
...

You can also assign contents to fields that are out of range. For example:

$ awk '{ $6 = ($5 + $4 + $3 + $2)
>        print $6 }' inventory-shipped
-| 168
-| 297
-| 301
...

We've just created $6, whose value is the sum of fields $2, $3, $4, and $5. The `+' sign represents addition. For the file `inventory-shipped', $6 represents the total number of parcels shipped for a particular month.

Creating a new field changes awk's internal copy of the current input record--the value of $0. Thus, if you do `print $0' after adding a field, the record printed includes the new field, with the appropriate number of field separators between it and the previously existing fields.

This recomputation affects and is affected by NF (the number of fields; see section Examining Fields), and by a feature that has not been discussed yet, the output field separator, OFS, which is used to separate the fields (see section Output Separators). For example, the value of NF is set to the number of the highest field you create.

Note, however, that merely referencing an out-of-range field does not change the value of either $0 or NF. Referencing an out-of-range field only produces an empty string. For example:

if ($(NF+1) != "")
    print "can't happen"
else
    print "everything is normal"

should print `everything is normal', because NF+1 is certain to be out of range. (See section The if-else Statement, for more information about awk's if-else statements. See section Variable Typing and Comparison Expressions, for more information about the `!=' operator.)

It is important to note that making an assignment to an existing field will change the value of $0, but will not change the value of NF, even when you assign the empty string to a field. For example:

$ echo a b c d | awk '{ OFS = ":"; $2 = ""
>                       print $0; print NF }'
-| a::c:d
-| 4

The field is still there; it just has an empty value. You can tell because there are two colons in a row.

This example shows what happens if you create a new field.

$ echo a b c d | awk '{ OFS = ":"; $2 = ""; $6 = "new"
>                       print $0; print NF }'
-| a::c:d::new
-| 6

The intervening field, $5 is created with an empty value (indicated by the second pair of adjacent colons), and NF is updated with the value six.

Specifying How Fields are Separated

This section is rather long; it describes one of the most fundamental operations in awk.

The Basics of Field Separating

The field separator, which is either a single character or a regular expression, controls the way awk splits an input record into fields. awk scans the input record for character sequences that match the separator; the fields themselves are the text between the matches.

In the examples below, we use the bullet symbol "*" to represent spaces in the output.

If the field separator is `oo', then the following line:

moo goo gai pan

would be split into three fields: `m', `*g' and `*gai*pan'. Note the leading spaces in the values of the second and third fields.

The field separator is represented by the built-in variable FS. Shell programmers take note! awk does not use the name IFS which is used by the POSIX compatible shells (such as the Bourne shell, sh, or the GNU Bourne-Again Shell, Bash).

You can change the value of FS in the awk program with the assignment operator, `=' (see section Assignment Expressions). Often the right time to do this is at the beginning of execution, before any input has been processed, so that the very first record will be read with the proper separator. To do this, use the special BEGIN pattern (see section The BEGIN and END Special Patterns). For example, here we set the value of FS to the string ",":

awk 'BEGIN { FS = "," } ; { print $2 }'

Given the input line,

John Q. Smith, 29 Oak St., Walamazoo, MI 42139

this awk program extracts and prints the string `*29*Oak*St.'.

Sometimes your input data will contain separator characters that don't separate fields the way you thought they would. For instance, the person's name in the example we just used might have a title or suffix attached, such as `John Q. Smith, LXIX'. From input containing such a name:

John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139

the above program would extract `*LXIX', instead of `*29*Oak*St.'. If you were expecting the program to print the address, you would be surprised. The moral is: choose your data layout and separator characters carefully to prevent such problems.

As you know, normally, fields are separated by whitespace sequences (spaces and tabs), not by single spaces: two spaces in a row do not delimit an empty field. The default value of the field separator FS is a string containing a single space, " ". If this value were interpreted in the usual way, each space character would separate fields, so two spaces in a row would make an empty field between them. The reason this does not happen is that a single space as the value of FS is a special case: it is taken to specify the default manner of delimiting fields.

If FS is any other single character, such as ",", then each occurrence of that character separates two fields. Two consecutive occurrences delimit an empty field. If the character occurs at the beginning or the end of the line, that too delimits an empty field. The space character is the only single character which does not follow these rules.

Using Regular Expressions to Separate Fields

The previous subsection discussed the use of single characters or simple strings as the value of FS. More generally, the value of FS may be a string containing any regular expression. In this case, each match in the record for the regular expression separates fields. For example, the assignment:

FS = ", \t"

makes every area of an input line that consists of a comma followed by a space and a tab, into a field separator. (`\t' is an escape sequence that stands for a tab; see section Escape Sequences, for the complete list of similar escape sequences.)

For a less trivial example of a regular expression, suppose you want single spaces to separate fields the way single commas were used above. You can set FS to "[ ]" (left bracket, space, right bracket). This regular expression matches a single space and nothing else (see section Regular Expressions).

There is an important difference between the two cases of `FS = " "' (a single space) and `FS = "[ \t]+"' (left bracket, space, backslash, "t", right bracket, which is a regular expression matching one or more spaces or tabs). For both values of FS, fields are separated by runs of spaces and/or tabs. However, when the value of FS is " ", awk will first strip leading and trailing whitespace from the record, and then decide where the fields are.

For example, the following pipeline prints `b':

$ echo ' a b c d ' | awk '{ print $2 }'
-| b

However, this pipeline prints `a' (note the extra spaces around each letter):

$ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t]+" }
>                                  { print $2 }'
-| a

In this case, the first field is null, or empty.

The stripping of leading and trailing whitespace also comes into play whenever $0 is recomputed. For instance, study this pipeline:

$ echo '   a b c d' | awk '{ print; $2 = $2; print }'
-|    a b c d
-| a b c d

The first print statement prints the record as it was read, with leading whitespace intact. The assignment to $2 rebuilds $0 by concatenating $1 through $NF together, separated by the value of OFS. Since the leading whitespace was ignored when finding $1, it is not part of the new $0. Finally, the last print statement prints the new $0.

Making Each Character a Separate Field

There are times when you may want to examine each character of a record separately. In gawk, this is easy to do, you simply assign the null string ("") to FS. In this case, each individual character in the record will become a separate field. Here is an example:

echo a b | gawk 'BEGIN { FS = "" }
                 { 
                     for (i = 1; i <= NF; i = i + 1)
                         print "Field", i, "is", $i
                 }'

The output from this is:

Field 1 is a
Field 2 is
Field 3 is b

Traditionally, the behavior for FS equal to "" was not defined. In this case, Unix awk would simply treat the entire record as only having one field (d.c.). In compatibility mode (see section Command Line Options), if FS is the null string, then gawk will also behave this way.

Setting FS from the Command Line

FS can be set on the command line. You use the `-F' option to do so. For example:

awk -F, 'program' input-files

sets FS to be the `,' character. Notice that the option uses a capital `F'. Contrast this with `-f', which specifies a file containing an awk program. Case is significant in command line options: the `-F' and `-f' options have nothing to do with each other. You can use both options at the same time to set the FS variable and get an awk program from a file.

The value used for the argument to `-F' is processed in exactly the same way as assignments to the built-in variable FS. This means that if the field separator contains special characters, they must be escaped appropriately. For example, to use a `\' as the field separator, you would have to type:

# same as FS = "\\" 
awk -F\\\\ '...' files ...

Since `\' is used for quoting in the shell, awk will see `-F\\'. Then awk processes the `\\' for escape characters (see section Escape Sequences), finally yielding a single `\' to be used for the field separator.

As a special case, in compatibility mode (see section Command Line Options), if the argument to `-F' is `t', then FS is set to the tab character. This is because if you type `-F\t' at the shell, without any quotes, the `\' gets deleted, so awk figures that you really want your fields to be separated with tabs, and not `t's. Use `-v FS="t"' on the command line if you really do want to separate your fields with `t's (see section Command Line Options).

For example, let's use an awk program file called `baud.awk' that contains the pattern /300/, and the action `print $1'. Here is the program:

/300/   { print $1 }

Let's also set FS to be the `-' character, and run the program on the file `BBS-list'. The following command prints a list of the names of the bulletin boards that operate at 300 baud and the first three digits of their phone numbers:

$ awk -F- -f baud.awk BBS-list
-| aardvark     555
-| alpo
-| barfly       555
...

Note the second line of output. In the original file (see section Data Files for the Examples), the second line looked like this:

alpo-net     555-3412     2400/1200/300     A

The `-' as part of the system's name was used as the field separator, instead of the `-' in the phone number that was originally intended. This demonstrates why you have to be careful in choosing your field and record separators.

On many Unix systems, each user has a separate entry in the system password file, one line per user. The information in these lines is separated by colons. The first field is the user's logon name, and the second is the user's encrypted password. A password file entry might look like this:

arnold:xyzzy:2076:10:Arnold Robbins:/home/arnold:/bin/sh

The following program searches the system password file, and prints the entries for users who have no password:

awk -F: '$2 == ""' /etc/passwd

Field Splitting Summary

According to the POSIX standard, awk is supposed to behave as if each record is split into fields at the time that it is read. In particular, this means that you can change the value of FS after a record is read, and the value of the fields (i.e. how they were split) should reflect the old value of FS, not the new one.

However, many implementations of awk do not work this way. Instead, they defer splitting the fields until a field is actually referenced. The fields will be split using the current value of FS! (d.c.) This behavior can be difficult to diagnose. The following example illustrates the difference between the two methods. (The sed(5) command prints just the first line of `/etc/passwd'.)

sed 1q /etc/passwd | awk '{ FS = ":" ; print $1 }'

will usually print

root

on an incorrect implementation of awk, while gawk will print something like

root:nSijPlPhZZwgE:0:0:Root:/:

The following table summarizes how fields are split, based on the value of FS. (`==' means "is equal to.")

FS == " "
Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS == any other single character
Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and trailing occurrences. The character can even be a regexp metacharacter; it does not need to be escaped.
FS == regexp
Fields are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty fields.
FS == ""
Each individual character in the record becomes a separate field.

Reading Fixed-width Data

(This section discusses an advanced, experimental feature. If you are a novice awk user, you may wish to skip it on the first reading.)

gawk version 2.13 introduced a new facility for dealing with fixed-width fields with no distinctive field separator. Data of this nature arises, for example, in the input for old FORTRAN programs where numbers are run together; or in the output of programs that did not anticipate the use of their output as input for other programs.

An example of the latter is a table where all the columns are lined up by the use of a variable number of spaces and empty fields are just spaces. Clearly, awk's normal field splitting based on FS will not work well in this case. Although a portable awk program can use a series of substr calls on $0 (see section Built-in Functions for String Manipulation), this is awkward and inefficient for a large number of fields.

The splitting of an input record into fixed-width fields is specified by assigning a string containing space-separated numbers to the built-in variable FIELDWIDTHS. Each number specifies the width of the field including columns between fields. If you want to ignore the columns between fields, you can specify the width as a separate field that is subsequently ignored.

The following data is the output of the Unix w utility. It is useful to illustrate the use of FIELDWIDTHS.

 10:06pm  up 21 days, 14:04,  23 users
User     tty       login  idle   JCPU   PCPU  what
hzuo     ttyV0     8:58pm            9      5  vi p24.tex 
hzang    ttyV3     6:37pm    50                -csh 
eklye    ttyV5     9:53pm            7      1  em thes.tex 
dportein ttyV6     8:17pm  1:47                -csh 
gierd    ttyD3    10:00pm     1                elm 
dave     ttyD4     9:47pm            4      4  w 
brent    ttyp0    26Jun91  4:46  26:46   4:41  bash 
dave     ttyq4    26Jun9115days     46     46  wnewmail

The following program takes the above input, converts the idle time to number of seconds and prints out the first two fields and the calculated idle time. (This program uses a number of awk features that haven't been introduced yet.)

BEGIN  { FIELDWIDTHS = "9 6 10 6 7 7 35" }
NR > 2 {
    idle = $4
    sub(/^  */, "", idle)   # strip leading spaces
    if (idle == "")
        idle = 0
    if (idle ~ /:/) {
        split(idle, t, ":")
        idle = t[1] * 60 + t[2]
    }
    if (idle ~ /days/)
        idle *= 24 * 60 * 60
 
    print $1, $2, idle
}

Here is the result of running the program on the data:

hzuo      ttyV0  0
hzang     ttyV3  50
eklye     ttyV5  0
dportein  ttyV6  107
gierd     ttyD3  1
dave      ttyD4  0
brent     ttyp0  286
dave      ttyq4  1296000

Another (possibly more practical) example of fixed-width input data would be the input from a deck of balloting cards. In some parts of the United States, voters mark their choices by punching holes in computer cards. These cards are then processed to count the votes for any particular candidate or on any particular issue. Since a voter may choose not to vote on some issue, any column on the card may be empty. An awk program for processing such data could use the FIELDWIDTHS feature to simplify reading the data. (Of course, getting gawk to run on a system with card readers is another story!)

Assigning a value to FS causes gawk to return to using FS for field splitting. Use `FS = FS' to make this happen, without having to know the current value of FS.

This feature is still experimental, and may evolve over time. Note that in particular, gawk does not attempt to verify the sanity of the values used in the value of FIELDWIDTHS.

Multiple-Line Records

In some data bases, a single line cannot conveniently hold all the information in one entry. In such cases, you can use multi-line records.

The first step in doing this is to choose your data format: when records are not defined as single lines, how do you want to define them? What should separate records?

One technique is to use an unusual character or string to separate records. For example, you could use the formfeed character (written `\f' in awk, as in C) to separate them, making each record a page of the file. To do this, just set the variable RS to "\f" (a string containing the formfeed character). Any other character could equally well be used, as long as it won't be part of the data in a record.

Another technique is to have blank lines separate records. By a special dispensation, an empty string as the value of RS indicates that records are separated by one or more blank lines. If you set RS to the empty string, a record always ends at the first blank line encountered. And the next record doesn't start until the first non-blank line that follows--no matter how many blank lines appear in a row, they are considered one record-separator.

You can achieve the same effect as `RS = ""' by assigning the string "\n\n+" to RS. This regexp matches the newline at the end of the record, and one or more blank lines after the record. In addition, a regular expression always matches the longest possible sequence when there is a choice (see section How Much Text Matches?) So the next record doesn't start until the first non-blank line that follows--no matter how many blank lines appear in a row, they are considered one record-separator.

There is an important difference between `RS = ""' and `RS = "\n\n+"'. In the first case, leading newlines in the input data file are ignored, and if a file ends without extra blank lines after the last record, the final newline is removed from the record. In the second case, this special processing is not done (d.c.).

Now that the input is separated into records, the second step is to separate the fields in the record. One way to do this is to divide each of the lines into fields in the normal manner. This happens by default as the result of a special feature: when RS is set to the empty string, the newline character always acts as a field separator. This is in addition to whatever field separations result from FS.

The original motivation for this special exception was probably to provide useful behavior in the default case (i.e. FS is equal to " "). This feature can be a problem if you really don't want the newline character to separate fields, since there is no way to prevent it. However, you can work around this by using the split function to break up the record manually (see section Built-in Functions for String Manipulation).

Another way to separate fields is to put each field on a separate line: to do this, just set the variable FS to the string "\n". (This simple regular expression matches a single newline.)

A practical example of a data file organized this way might be a mailing list, where each entry is separated by blank lines. If we have a mailing list in a file named `addresses', that looks like this:

Jane Doe
123 Main Street
Anywhere, SE 12345-6789

John Smith
456 Tree-lined Avenue
Smallville, MW 98765-4321

...

A simple program to process this file would look like this:

# addrs.awk -- simple mailing list program

# Records are separated by blank lines.
# Each line is one field.
BEGIN { RS = "" ; FS = "\n" }

{
      print "Name is:", $1
      print "Address is:", $2
      print "City and State are:", $3
      print ""
}

Running the program produces the following output:

$ awk -f addrs.awk addresses
-| Name is: Jane Doe
-| Address is: 123 Main Street
-| City and State are: Anywhere, SE 12345-6789
-| 
-| Name is: John Smith
-| Address is: 456 Tree-lined Avenue
-| City and State are: Smallville, MW 98765-4321
-| 
...

See section Printing Mailing Labels, for a more realistic program that deals with address lists.

The following table summarizes how records are split, based on the value of RS. (`==' means "is equal to.")

RS == "\n"
Records are separated by the newline character (`\n'). In effect, every line in the data file is a separate record, including blank lines. This is the default.
RS == any single character
Records are separated by each occurrence of the character. Multiple successive occurrences delimit empty records.
RS == ""
Records are separated by runs of blank lines. The newline character always serves as a field separator, in addition to whatever value FS may have. Leading and trailing newlines in a file are ignored.
RS == regexp
Records are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty records.

In all cases, gawk sets RT to the input text that matched the value specified by RS.

Explicit Input with getline

So far we have been getting our input data from awk's main input stream--either the standard input (usually your terminal, sometimes the output from another program) or from the files specified on the command line. The awk language has a special built-in command called getline that can be used to read input under your explicit control.

Introduction to getline

This command is used in several different ways, and should not be used by beginners. It is covered here because this is the chapter on input. The examples that follow the explanation of the getline command include material that has not been covered yet. Therefore, come back and study the getline command after you have reviewed the rest of this book and have a good knowledge of how awk works.

getline returns one if it finds a record, and zero if the end of the file is encountered. If there is some error in getting a record, such as a file that cannot be opened, then getline returns -1. In this case, gawk sets the variable ERRNO to a string describing the error that occurred.

In the following examples, command stands for a string value that represents a shell command.

Using getline with No Arguments

The getline command can be used without arguments to read input from the current input file. All it does in this case is read the next input record and split it up into fields. This is useful if you've finished processing the current record, but you want to do some special processing right now on the next record. Here's an example:

awk '{
     if ((t = index($0, "/*")) != 0) {
          # value will be "" if t is 1
          tmp = substr($0, 1, t - 1)
          u = index(substr($0, t + 2), "*/")
          while (u == 0) {
               if (getline <= 0) {
                    m = "unexpected EOF or error"
                    m = (m ": " ERRNO)
                    print m > "/dev/stderr"
                    exit
               }
               t = -1
               u = index($0, "*/")
          }
          # substr expression will be "" if */
          # occurred at end of line
          $0 = tmp substr($0, t + u + 3)
     }
     print $0
}'

This awk program deletes all C-style comments, `/* ... */', from the input. By replacing the `print $0' with other statements, you could perform more complicated processing on the decommented input, like searching for matches of a regular expression. This program has a subtle problem--it does not work if one comment ends and another begins on the same line.

This form of the getline command sets NF (the number of fields; see section Examining Fields), NR (the number of records read so far; see section How Input is Split into Records), FNR (the number of records read from this input file), and the value of $0.

Note: the new value of $0 is used in testing the patterns of any subsequent rules. The original value of $0 that triggered the rule which executed getline is lost (d.c.). By contrast, the next statement reads a new record but immediately begins processing it normally, starting with the first rule in the program. See section The next Statement.

Using getline Into a Variable

You can use `getline var' to read the next record from awk's input into the variable var. No other processing is done.

For example, suppose the next line is a comment, or a special string, and you want to read it, without triggering any rules. This form of getline allows you to read that line and store it in a variable so that the main read-a-line-and-check-each-rule loop of awk never sees it.

The following example swaps every two lines of input. For example, given:

wan
tew
free
phore

it outputs:

tew
wan
phore
free

Here's the program:

awk '{
     if ((getline tmp) > 0) {
          print tmp
          print $0
     } else
          print $0
}'

The getline command used in this way sets only the variables NR and FNR (and of course, var). The record is not split into fields, so the values of the fields (including $0) and the value of NF do not change.

Using getline from a File

Use `getline < file' to read the next record from the file file. Here file is a string-valued expression that specifies the file name. `< file' is called a redirection since it directs input to come from a different place.

For example, the following program reads its input record from the file `secondary.input' when it encounters a first field with a value equal to 10 in the current input file.

awk '{
    if ($1 == 10) {
         getline < "secondary.input"
         print
    } else
         print
}'

Since the main input stream is not used, the values of NR and FNR are not changed. But the record read is split into fields in the normal manner, so the values of $0 and other fields are changed. So is the value of NF.

Using getline Into a Variable from a File

Use `getline var < file' to read input the file file and put it in the variable var. As above, file is a string-valued expression that specifies the file from which to read.

In this version of getline, none of the built-in variables are changed, and the record is not split into fields. The only variable changed is var.

For example, the following program copies all the input files to the output, except for records that say `@include filename'. Such a record is replaced by the contents of the file filename.

awk '{
     if (NF == 2 && $1 == "@include") {
          while ((getline line < $2) > 0)
               print line
          close($2)
     } else
          print
}'

Note here how the name of the extra input file is not built into the program; it is taken directly from the data, from the second field on the `@include' line.

The close function is called to ensure that if two identical `@include' lines appear in the input, the entire specified file is included twice. See section Closing Input and Output Files and Pipes.

One deficiency of this program is that it does not process nested `@include' statements (`@include' statements in included files) the way a true macro preprocessor would. See section An Easy Way to Use Library Functions, for a program that does handle nested `@include' statements.

Using getline from a Pipe

You can pipe the output of a command into getline, using `command | getline'. In this case, the string command is run as a shell command and its output is piped into awk to be used as input. This form of getline reads one record at a time from the pipe.

For example, the following program copies its input to its output, except for lines that begin with `@execute', which are replaced by the output produced by running the rest of the line as a shell command:

awk '{
     if ($1 == "@execute") {
          tmp = substr($0, 10)
          while ((tmp | getline) > 0)
               print
          close(tmp)
     } else
          print
}'

The close function is called to ensure that if two identical `@execute' lines appear in the input, the command is run for each one. See section Closing Input and Output Files and Pipes.

Given the input:

foo
bar
baz
@execute who
bletch

the program might produce:

foo
bar
baz
arnold     ttyv0   Jul 13 14:22
miriam     ttyp0   Jul 13 14:23     (murphy:0)
bill       ttyp1   Jul 13 14:23     (murphy:0)
bletch

Notice that this program ran the command who and printed the result. (If you try this program yourself, you will of course get different results, showing you who is logged in on your system.)

This variation of getline splits the record into fields, sets the value of NF and recomputes the value of $0. The values of NR and FNR are not changed.

Using getline Into a Variable from a Pipe

When you use `command | getline var', the output of the command command is sent through a pipe to getline and into the variable var. For example, the following program reads the current date and time into the variable current_time, using the date utility, and then prints it.

awk 'BEGIN {
     "date" | getline current_time
     close("date")
     print "Report printed on " current_time
}'

In this version of getline, none of the built-in variables are changed, and the record is not split into fields.

Summary of getline Variants

With all the forms of getline, even though $0 and NF, may be updated, the record will not be tested against all the patterns in the awk program, in the way that would happen if the record were read normally by the main processing loop of awk. However the new record is tested against any subsequent rules.

Many awk implementations limit the number of pipelines an awk program may have open to just one! In gawk, there is no such limit. You can open as many pipelines as the underlying operating system will permit.

The following table summarizes the six variants of getline, listing which built-in variables are set by each one.

getline
sets $0, NF, FNR, and NR.
getline var
sets var, FNR, and NR.
getline < file
sets $0, and NF.
getline var < file
sets var.
command | getline
sets $0, and NF.
command | getline var
sets var.


Mini annuaire : Gawk

Youhp3Youpee est un preprocesseur HTML pour vous simplifier toutes les tâches répétitives dans la création d'un site web. Salemioche.net utilise trés largement ses possibilités
cygwinle compilateur gcc sous windows ainsi que tous les outils unix (awk, grep, sed, bash, ksh ...)