Introduction
The gawk
command is the GNU version of awk. Gawk is a powerful text-processing and data-manipulating tool with many features and practical uses.
This guide will teach you how to use the Linux gawk
command with examples.
Prerequisites
- A system running Linux.
- Access to the terminal.
- A text file. This tutorial uses the file people as an example.
gawk Linux Command Syntax
The basic gawk
syntax looks like this:
gawk [options] [actions/filters] input_file
The command cannot be run without any arguments. The options are not mandatory, but for gawk
to produce output, at least one action should be assigned. Actions and filters are different subcommands and selection criteria that enable gawk
to manipulate data from the input file.
Note: Encase options and actions in single quotes.
gawk Options
The gawk
command is a versatile tool thanks to its numerous arguments. With gawk
being the GNU implementation of awk
, long, GNU-style options are available. Each long option has a corresponding short one.
Common options are presented below:
Option | Description |
---|---|
-fprogram-file , --file program-file | Reads commands from a file, which serves as a script, instead of the first argument in the terminal. |
-Ffs , --field-separator fs | Usesthe predefined variable fs as the input field separator. |
-vvar=val , --assign var=val | Assigns a valueto the variablebefore executing a script. |
-b , --characters-as-bytes | Treats all data as single-byte characters. |
-c ,--traditional | Executes gawk in compatibility mode. |
-C ,--copyright | Displays the GNU Copyright message. |
-d[file] , --dump-variables[=file] | Shows a list of variables, their types, and values. |
-eprogram-text , --source program-text | Allows the mixing of library functions and source code. |
-Efile ,--execfile | Turns off terminal variable assignments. |
-L[value] , --lint[=value] | Prints warning messages about code not portable to other AWK implementations. |
-S ,--sandbox | Runsgawk in sandbox mode. |
gawk Built-in Variables
The gawk
command offers several built-in variables used to store and add value to the command. Variables are manipulated from the terminal and only affect the program when a user assigns value to them. Some important gawk
built-in variables are:
Variable | Description |
---|---|
ARGC | Shows the number of terminal arguments. |
ARGIND | Displays the ARGV file index. |
ARGV | Presents an array of terminal arguments. |
ERRNO | Contains strings describing a system error. |
FIELDWIDTHS | Displays white-space separated list of field widths. |
FILENAME | Prints the input file name. |
FNR | Shows input record number. |
FS | Represents the input field separator. |
IGNORECASE | Turns case-sensitive search on or off. |
NF | Prints the input file field count. |
NR | Prints the current file line count. |
OFS | Displays the output field separator. |
ORS | Shows the output record separator. |
RS | Prints the input record separator. |
RSTART | Represents the index of the first matched character. |
RLENGTH | Represents the matched string length. |
gawk Examples
The use of gawk
pattern-matching and language-processing functions are extensive. This article aims to provide practical examples through which users learn to use the gawk utility.
Important: The gawk
command is case-sensitive. Use the IGNORECASE
variable to ignore case.
Print Files
By default, gawk
with a print
argument displays every line from the specified file. For instance, running the cat command on the people text file prints the following:
The gawk
command displays the same result:
gawk '{print}' people
Print a Column
In text files, spaces are usually used as delimiters for columns. The people file consists of four columns:
- Ordinal numbers.
- First names.
- Last names.
- Year of birth.
Use gawk
to show only a specific column in the terminal. For instance:
gawk '{print $2}' people
The command prints only the second column. To print multiple columns, like column one (ordinal numbers) and column two (first names), run:
gawk '{print $1, $2}' people
The gawk
command also works without the comma between $1
and $2
. However, there are no spaces between columns in the output:
gawk '{print $1 $2}' people
Filter Columns
The gawk
command offers additional filtering options. For instance, print lines containing the capital letter O with:
gawk '/O/ {print}' people
To show only lines containing letters O or A, use piping:
gawk '/O|A/ {print}' people
The command prints any line that includes a word with capital O or A. On the other hand, use logical AND (&&
) to show lines including both O and the year 1995:
gawk '/O/ && /1995/' people
The filters work with numbers as well. For example, show only people born in the 1990s with:
gawk '/199*/ {print}' people
The output shows only lines in which the fourth column includes the value 199.
Customize the output even more by combining previously mentioned options. For example, print only the first and last names of people born in 1995 or 2003 with:
gawk '/1995|2003/ {print $2, $3}' people
The command prints columns two and three as stated in the {print $2, $3}
part. The output only shows lines containing the numbers 1995 and 2003, even though columns containing those numbers are hidden.
The gawk
command also lets users print everything except for the lines containing the specified string with the logical NOT(!
). For instance, omit lines containing the string 19 in the output:
gawk '!/19/' people
Add Line Numbers
The people file includes line numbers in the first column. In case users are working on a file without line numbers, gawk
presents options to add them.
For instance, the humans file doesn't include any ordinal numbers:
To add line numbers, execute gawk
with FNR
and next
:
gawk '{ print FNR, $0; next}' humans
The command adds a line number before each line. The same result is achieved with the NR
variable:
gawk '{print NR, $0}' mobile.txt
Find Line Count
To count the total number of lines in the file, use the END
statement and the NR
variable with gawk
:
gawk 'END { print NR }' people
The command reads each line. Once gawk
reaches END
, it prints the value of NR
- which contains the total number of lines. Running the same command without the END
statement prints only the value of NR
- the number of lines:
Filter Lines Based on Length
Use the following command option to print only lines longer than 20 characters:
gawk 'length>20' people
It also works with multiple arguments. For instance, show lines longer than 17 but shorter than 20 characters:
gawk 'length<20 && length>17' people
To display lines that are exactly 20 characters long, run:
gawk 'length==20' people
Print Info Based on Conditions
The gawk
command allows for the use of the if-else statements. For instance, another way to filter only people born after 1999 is with a simple if statement:
gawk '{ if ($4>1999) print }' people
The if statement sets the condition that entries in column four have to be larger than 1999. The output shows only entries that satisfy the condition. Expand the command into an if-else statement to print lines not satisfying the original condition.
gawk '{if ($4>1999) print $0," ==>00s"; else print $0, "==>90s"}' people
The command includes:
- If statement. If the condition is satisfied,
gawk
adds a string "==>90s" to the output line. - Else statement. In case the line doesn't satisfy the condition,
gawk
still prints that line in the output, adding the "==>00s" string to the output.
Add a Header
In the same way in which the END
statement allows users to modify the output at the end of the file, the BEGIN
statement formats the data at the beginning.
When used with awk
, the BEGIN
sections are always executed first. After that, awk
executes the remaining lines. One way to use the BEGIN
statement is to add a header to the output.
Execute the following command to add a section above the awk
output:
gawk 'BEGIN {print "No/First&Last Name/Year of Birth"} {print $0}' people
Find the Longest Line Length
Combine previous arguments with the if and END
statementsto find the longest line in the people file:
gawk '{ if (length($0) > max) max = length($0) } END { print max }' people
Find the Number of Fields
The gawk
command also allows users to display the number of fields with the NF
variable. The simplest way to display the number of fields prints a difficult-to-read output:
gawk '{print NF}' people
The command outputs the number of fields per line without any additional info. To customize the output and make it more human-readable, adjust the initial command:
gawk '{print NR, "-->", NF}' people
The command now includes:
- The
NR
variable that adds line numbers to each output line. - The
-->
string that separates line numbers from the field numbers.
Another way to show line and field numbers in the people file is to print columns with NF
. Note that the people file includes ordinal numbers in column one. Therefore the NR
variable is omitted:
gawk '{print $0, "-->", NF}' people
Finally, to print the total number of fields, execute:
gawk '{num_fields = num_fields + NF} END {print num_fields}' people
The file does have ten lines and four columns. Hence, the output is correct.
Conclusion
After going through this tutorial, you know how to use thegawk
for advanced text processing and data manipulation.
Also consider using grep, a powerful Linux tool for searching for strings, words, and patterns.