AWK - quick reference

AWK awk is an extremely versatile pattern matching and processing language. Although "awk scripts" can be written simply on the command line, more sophisticated use involves script files. Where the shell language is geared to running various utilities, the awk language is geared to extracting information from files according to rules and patterns.
As a programming language (which looks much like C in some respects), awk has the usual selection, loop, and condition constructs, and has variables. Awk variables are not referenced with a leading $ as in the shell, except for the variables representing fields in the input line: $1, $2, ... . See below. As with the shell, comments come after the # character.
To invoke an awk script is similar to the first method for shell scripts (an awk script cannot be made executable, however, an executable shell script with a here-document for awk could be used):
awk -f script-file input-files
If is missing, awk looks for input from stdin. The output from awk is to stdout. awk can thus be used as a "filter" in a pipe.
The general form of an awk script is pattern { action }
pattern { action }
.
.
each pattern-action must be on one line (if the actual line is too long the trailing \ trick can be used).
For each input line, awk checks to see if it matches a pattern. If it does, the corresponding action is taken. Either pattern or action is optional. If pattern is absent (i.e. a script line is written: {action}) then the action applies to every line; if the action (and the {}) is missing the default is to print the matching line. If more than one pattern/action combination matches a line, they are done in order.
PATTERNSAwk patterns are similar to those used by the shell ("regular expressions"). However, awk patterns can also be more complicated, and can involve Boolean combinations and relational expressions.
Regular expressions must be surrounded by slashes (/); a literal slash is obtained by quoting it with the backslash: \/. For a description of Regular Expressions see the man page for ed (but not all syntaxes are exactly identical, so be wary!)
The boolean operators are ! (not), (or), && (and) plus parenthesis (just like in C). A relational expression involves the relational operators of C: >, >=, ==, <, <=, !=; plus two matching operators ~ (contains) and !~ (does not contain).
Example patterns are: length < 72 # lines shorter than 72 characters (length an awk var)
/fred/ /jane/ # lines containing either 'fred' or 'jane'
/fred/ && /ethel/ # lines containing both 'fred' and 'ethel'
$1 > 10 $2 < 20 # lines where field 1 is greater than 10, or
# field 2 is less than 20
$5 ~ /.c/ # lines where field 5 contains .c
NR == 5 # line 5 in the file (NR is the record number)
The special patterns BEGIN and END match respectively the beginning and the end of the file, and must appear first and last among the patterns. They are used to cause actions at the start and end of processing. For example, to change the field separator (which controls how awk splits up the input line among $1, $2, ...) try
BEGIN { FS=":" } # make the field sep. a colon, useful for parsing password files
This can also be set on the command line with the awk -F flag.
A pattern may consist of two patterns separated by a comma; the action is performed for all lines between an occurrence of the first pattern and the next occurrence of the second. For example,NR == 10, NR == 20 # all lines between line 10 and 20 inclusive
Variables
awk variables are used like those in the shell: they do not need to be declared, they are either strings or integers as appropriate. Variables are initialized to the null string; when used as an integer this is equivalent to 0.
awk defines several program variables: $0 entire input line
FS character used to separate fields, initially a blank
$1,$2, ... individual fields in the current line
NF number of fields in current line, last field is $NF
NR record number of current line
FILENAME name of current file
length length of current line
For integer-like variables, the usual C operators can be used: +, -, *, /, %, ++, --, +=, -=, *=, /=, and %= .
There are several built-in numeric and string functions, such as exp, log, sqrt, and int (which truncates its argument to an integer). substr(s, m, n) returns the n-character substring of s that begins at position m. The function length() returns the length of its argument (note in awk function calling is like that in C, parenthesis are used, except for the print functions).
ArraysVariables can also be arrays (denoted x[i]) which are automatically created and extended with use. Array subscripts can be any string (as well as integer-like expressions) which allows for an "associative" memory. For example, colour["red"]=1 (string constants are enclosed in quotes). This can be used to count, for example, how many red, green, and blue marbles are noted in a file.
A line of text can be split into the elements of an array using 'split':n = split (s, arr, sep);
where n is the number of actual array elements created, s is the string to be split and placed in the array arr, and sep is an optional separator character (default is a blank). For example,split ($0, ent, ":")splits the current line into the array 'ent' using the colon as the separator (this is useful for dealing with password files).
ACTIONSAn action is a sequence of statements, each terminated by a semi-colon, a newline, or a right brace (}). An awk action (or statement) looks very much like a block in C. The statements can be any of the following: if ( conditional ) statement [ else statement ] # [] denote optional
while ( conditional ) statement
for ( expression ; conditional ; expression ) statement
break
continue
{ [ statement ] ... }
variable = expression
print [ expression-list ]
printf format [ , expression-list ]
next # skip remaining patterns on this input line
exit # skip the rest of the input
Observe that the syntax of these statements is essentially that of C. 'statement' can be a brace-enclosed list of statements. The conditionals are constructed in the usual fashion using the relational operators above.
The 'for' loop has an extended syntax for use with associative arrays:
for (elem in array) action
where elem is a variable that is assigned the value of each array element in succession (the order is likely to be indeterminate).
PrintThe print statements ('print' and 'printf') provide explicit control of output. Each function takes a list of expressions or variables and prints the values of each. 'print' provides a general output mechanism while 'printf' requires a format string of the same type as the C printf() function and allows the user to precisely control the output format. (Note that parenthesis are not used with the awk print functions).
AWK EXAMPLESawk scripts can range from the trivial to the horrendously complex. Some examples are below. #print fields backward for each line
{ for (i = NF; i > 0; --i) printf "%s ", $i; printf "\n"}


# compute the average of the numbers in column 3
{ s += $3}
END {printf "average value = %f\n", s / NR}


#print all lines whose first field is different from previous
$1 != prev { print; prev = $1 }


# look for pairs of identical words
FILENAME != prevfile { # new file
NR = 1 # reset line number
prevfile = FILENAME # save this file's name
}
NF > 0 { # work on non-empty lines
if ($1 == lastword)
printf "double %s, file %s, line %d\n", $1, FILENAME,NR
for (i = 2; i <= NF; i++)
if ($i == $(i-1)) # is field 'i' same as previous?
printf "double %s, file %s, line %d\n", $1, FILENAME,NR
lastword = $NF # save last word to compare with first
# on next line
}

Comments

Popular posts from this blog

GDB - A quick reference

SQL-QuickReference