You are on page 1of 14

Awk Introduction and Printing Operations

Awk is a programming language which allows easy manipulation of structured data and the generation of formatted reports. Awk stands for the names of its authors Aho, Weinberger, and Kernighan The Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then perform associated actions. Some of the key features of Awk are:
y y y y

Awk views a text file as records and fields. Like common programming language, Awk has variables, conditionals and loops Awk has arithmetic and string operators. Awk can generate formatted reports

Awk reads from a file or from its standard input, and outputs to its standard output. Awk does not get along with non-text files.
Syntax: awk '/search pattern1/ {Actions} /search pattern2/ {Actions}' file

In the above awk syntax:


y y y y y

search pattern is a regular expression. Actions statement(s) to be performed. several patterns and actions are possible in Awk. file Input file. Single quotes around program is to avoid shell not to interpret any of its special characters.

Awk Working Methodology


1. Awk reads the input files one line at a time. 2. For each line, it matches with given pattern in the given order, if matched performs the corresponding action. 3. If no pattern matches, no action will be performed. 4. In the above syntax, either search pattern or action are optional, But not both. 5. If the search pattern is not given, then Awk performs the given actions for each line of the input. 6. If the action is not given, print all that lines that matches with the given patterns which is the default action. 7. Empty braces without any action does nothing. It won t perform default printing operation. 8. Each statement in Actions should be delimited by semicolon.

Let us create employee.txt file which has the following content, which will be used in the examples mentioned below.
$cat 100 200 300 400 500 employee.txt Thomas Manager Jason Developer Sanjay Sysadmin Nisha Manager Randy DBA Sales Technology Technology Marketing Technology $5,000 $5,500 $7,000 $9,500 $6,000

Awk Example 1. Default behavior of Awk

By default Awk prints every line from the file.


$ awk '{print;}' employee.txt 100 Thomas Manager Sales 200 Jason Developer Technology 300 Sanjay Sysadmin Technology 400 Nisha Manager Marketing 500 Randy DBA Technology $5,000 $5,500 $7,000 $9,500 $6,000

In the above example pattern is not given. So the actions are applicable to all the lines. Action print with out any argument prints the whole line by default. So it prints all the lines of the file with out fail. Actions has to be enclosed with in the braces.
Awk Example 2. Print the lines which matches with the pattern.

$ awk '/Thomas/ > /Nisha/' employee.txt


100 400 Thomas Nisha Manager Manager Sales Marketing $5,000 $9,500

In the above example it prints all the line which matches with the Thomas or Nisha. It has two patterns. Awk accepts any number of patterns, but each set (patterns and its corresponding actions) has to be separated by newline. Awk Example 3. Print only specific field. Awk has number of built in variables. For each record i.e line, it splits the record delimited by whitespace character by default and stores it in the $n variables. If the line has 4 words, it will be stored in $1, $2, $3 and $4. $0 represents whole line. NF is a built in variable which represents total number of fields in a record.
$ awk '{print $2,$5;}' employee.txt Thomas $5,000 Jason $5,500 Sanjay $7,000 Nisha $9,500 Randy $6,000

$ awk '{print $2,$NF;}' employee.txt

Thomas $5,000 Jason $5,500 Sanjay $7,000 Nisha $9,500 Randy $6,000

In the above example $2 and $5 represents Name and Salary respectively. We can get the Salary using $NF also, where $NF represents last field. In the print statement , is a concatenator. Awk Example 4. Initialization and Final Action Awk has two important patterns which are specified by the keyword called BEGIN and END. Syntax: BEGIN { Actions} {ACTION} # Action for everyline in a file END { Actions } # is for comments in Awk Actions specified in the BEGIN section will be executed before starts reading the lines from the input. END actions will be performed after completing the reading and processing the lines from the input. $ awk 'BEGIN {print "Name\tDesignation\tDepartment\tSalary";} > {print $2,"\t",$3,"\t",$4,"\t",$NF;} > END{print "Report Generated\n--------------"; >}' employee.txt
Name Designation Thomas Manager Jason Developer Sanjay Sysadmin Nisha Manager Randy DBA Report Generated -------------Department Sales Technology Technology Marketing Technology Salary $5,000 $5,500 $7,000 $9,500 $6,000

In the above example, it prints headline and last file for the reports.
Awk Example 5. Find the employees who has employee id greater than 200

$ awk '$1 >200' employee.txt


300 400 500 Sanjay Nisha Randy Sysadmin Manager DBA Technology Marketing Technology $7,000 $9,500 $6,000

In the above example, first field ($1) is employee id. So if $1 is greater than 200, then just do the default print action to print the whole line.
Awk Example 6. Print the list of employees in Technology department

Now department name is available as a fourth field, so need to check if $4 matches with the string Technology, if yes print the line. $ awk '$4 ~/Technology/' employee.txt
200 300 500 Jason Sanjay Randy Developer Sysadmin DBA Technology Technology Technology $5,500 $7,000 $6,000

Operator ~ is for comparing with the regular expressions. If it matches the default action i.e print whole line will be performed.
Awk Example 7. Print number of employees in Technology department

The below example, checks if the department is Technology, if it is yes, in the Action, just increment the count variable, which was initialized with zero in the BEGIN section. $ awk 'BEGIN { count=0;} $4 ~ /Technology/ { count++; } END { print "Number of employees in Technology Dept =",count;}' employee.txt Number of employees in Tehcnology Dept = 3 Then at the end of the process, just print the value of count which gives you the number of employees in Technology department.

Examples with awk: A short introduction

This article gives some insight in to the tricks that you can do with AWK. It is not a tutorial but it provides real live examples to use.

Originally, the idea to write this text came to me after reading a couple of articles published in LinuxFocus that were written by Guido Socher. One of them, about find and related commands, showed me that I was not the only one who used the command line. Pretty GUIs don't tell you how the things are really done (that's the way that Windows went years ago). The other article was about regular expressions. Although regular expressions are only slightly touched in this article, you need to know them to get the maximum from awk and other commands like sed and grep. The key question is whether this awk command is really useful. The answer is definitly yes! It could be useful for a normal user to process text files, re-format them etc... For a system administrator AWK is really a very important utility. Just walk around /var/yp/Makefile or look at the initialization scripts . AWK is used everywhere.

Introduction to awk
My first news about AWK are old enough for being forgotten. I had a colleague who needed to work with some really big outputs from a small Cray. The manual page for awk on the Cray was small, but he said that AWK looks very much like the thing he needs although he did not yet understand how to use it. A long time later, we are back in my life again. A colleague of mine used AWK to extract the first column from a file with the command:

awk ' '{print $1}' file

Easy, isn't it? This simple task does not need complex programming in C. One line of AWK does it.

Once we have learned the lesson on how to extract a column we can do things such as renaming files (append .new to "files_list"):

ls files_list | awk '{print "mv "$1" "$1".new"}' | sh

... and more:


1. Renaming within the name:

ls -1 *old* | awk '{print "mv "$1" "$1}' | sed s/old/new/2 | sh (although in some cases it will fail, as in file_old_and_old) 2. remove only files:

ls -l * | grep -v drwx | awk '{print "rm "$9}' | sh


or with awk alone:

ls -l|awk '$1!~/^drwx/{print $9}'|xargs rm

Be careful when trying this out in your home directory. We remove files! 3. remove only directories

ls -l | grep '^d' | awk '{print "rm -r "$9}' | sh or ls -p | grep /$ | awk '{print "rm -r "$1}'

or with awk alone:

ls -l|awk '$1~/^d.*x/{print $9}'|xargs rm -r

Be careful when trying this out in your home directory. We remove things! 4. killing processes by name (in this example we kill the process called netscape):

kill `ps auxww | grep netscape | egrep -v grep | awk '{print $2}'`

or with awk alone:

ps auxww | awk '$0~/netscape/&&$0!~/awk/{print $2}' |xargs kill

It has to be adjusted to fit the ps command on whatever unix system you are on. Basically it is: "If the process is called netscape and it is not called 'grep netscape' (or awk) then print the pid"

As you can see, AWK really helps when the same calculations are repeated over and over ... and apart from that it is much more fun to write an AWK program than doing almost the same thing 20 times manually.
awk is a little programming language, with a syntax close to C in many aspects. It is an interpreted language and the awk interpreter processes the instructions.

About the syntax of the awk command interpreter itself:


# gawk --help Usage: gawk [POSIX or GNU style options] -f progfile [--] file ... gawk [POSIX or GNU style options] [--] 'program' file ... POSIX options: GNU long options: -f progfile --file=progfile -F fs --field-separator=fs -v var=val --assign=var=val -m[fr] val -W compat --compat -W copyleft --copyleft -W copyright --copyright -W help --help -W lint --lint -W lint-old --lint-old -W posix --posix -W re-interval --re-interval -W source=program-text --source=program-text -W traditional --traditional -W usage --usage -W version --version

Instead of simply quoting (') the programs in the command line, we can, as you can see above, write the instructions into a file, and call it with the option -f. With command line defined variables using -v var=val we can add some flexibility to the programs.

Awk is, roughly speaking, a language oriented to manage tables. That is some information which can be grouped inside fields and records. The advantage here is that the record definition (and the field definition) is flexible. Awk is powerful. It's designed for work with one-line records, but that point could be relaxed. In order to see in some of these aspects, we are going to look at some illustrative (and real) examples.
y

Printing tables in a slightly prettier way Maybe, you have had to print some ASCII table obtained from somewhere. For example the hostnames, ethernet and IP numbers in a list. When those tables are really big, reading becames difficult, and we feel that we need this list printed with LaTeX or, at least, with a better format. If the table is simple then it's not too dificult:

BEGIN { printf "LaTeX preample" printf "\\begin{tabular}{|c|c|...|c|}" } { printf $1" & " printf $2" & " . . . printf $n" \\\\ " printf "\\hline" } END { print "\\end{document}" }

Certainly, this is not a generic program, but we're just starting ...
(The double backslashes (\) are necessary because it's the shell escape character) y

Slicing output files SIMBAD is an astronomical objects database that, among other things, provides a stars positions on the sky plane. Once in the past I needed to perform searches to draw charts around some objects. The interface allowed to save the results in text files, and I had two approaches: 1) create one file for each object, or 2) feed it with the whole input list, getting a single big output log file with the query results. As I decided to go for the second approach, I used awk for slicing the big output log. Obviously, I needed to take advantage on some output characteristics.

( $1 == "====>" ) { NomObj = $2 TotObj = $4 1. Each request produces a header line with a format like if ( TotObj > 0 ) ====> name : nlines <==== { The first header allow us to know when a new object FS = "|" begans, and the fourth how many entries the object for ( cont=0 ; contains. cont<TotObj ; cont++ ) { 2. The character used in the output lists to mark different getline columns was '|'. This requires two additional code lines to print $2 $4 filter to the output and get only the fields I was interessted $5 $3 >> NomObj in. } FS = " " } }

Acutally, the object name was not returned, and it was sligthly more complicated, but this is supposed to be an illustrative example.

Playing with the mail spool Maybe we are administrating a mailing list and from time to time, some special messages are submitted to the list (for example, monthly reports) with some specific format (subject as '[MONTH REPORT] month , dept'). Suddenly, we decide at the end of the year put together all these messages, saving aside the others. This can be done by processing the mail spool with the awk program on the left.

BEGIN { BEGIN_MSG = "From" BEGIN_BDY = "Precedence:" MAIN_KEY = "Subject:" VALIDATION = "[MONTH REPORT]"

HEAD = "NO"; BODY = "NO"; PRINT="NO" To get each report written to an individual file means three OUT_FILE = extra lines of code. "Month_Reports" } { if ( $1 == BEGIN_MSG ) { HEAD = "YES"; BODY = "NO"; PRINT="NO" } if ( $1 == MAIN_KEY ) { if ( $2 == VALIDATION ) { PRINT = "YES" NOTE: This example assumes that the mail spool is structured as I $1 = ""; $2 = think it is. This programs works for my mail. ""

print "\n\n"$0"\n" > OUT_FILE } } if ( $1 == BEGIN_BDY ) { getline if ( $0 == "" ) { HEAD = "NO"; BODY = "YES" } else { HEAD = "NO"; BODY = "NO"; PRINT="NO" } } if ( BODY == "YES" && PRINT == "YES" ) { print $0 >> OUT_FILE } }

I've used awk for many other tasks (automatic generation of web pages with information from simple databases) and I know enough about awk programming to be sure that a lot of things can be done. Just let your imagination fly.

A problem One problem is that awk needs perfect tabular information, no holes, awk does e.g not work with fixed width columns. This is not problematic if we create by ourself the awk input: choose something uncommon to separate the fields, later we fix it with FS and we are done!!! If we already have the input this could be a little more problematic. For example a table like this:
1234 1235 HD 13324 HD122235 22:40:54 .... 22:43:12 ....

This is difficult to handle this with awk. Unfortunately this is quite common. If we have only one column with this characteristics, we can solve the problem (if anybody knows how to manage more than one column in a generic case, please let me know!). I had to face one of these tables, similar to the one described above. The second column was a name and it included a variable number of spaces. As it usually happens, I had to sort it using the last column.

... and a solution I realized that the column I wanted to sort was the last one and awk knows how many fields there are in the current registry. Therefore, it was enough to access the last one (sometimes $4, and sometimes $5, but always NF). At the end of the day, the desired result was obtained:
awk '{ printf $NF;$NF = "" ;printf " "$0"\n" }' | sort

This just shifts the last colum to the first position and you can sort it. Obviously, this method is easily applied to the third field starting from the end, or to the field which goes after a control field which has always the same value. Just use your ideas and imagination.

Deeper AWK
Working over matched lines

Up to now, nearly all the examples process all the input file lines. But, as also the manual page states, it is possible to process only some of the input lines. One must just preceed the group of commands with the condition the line should meet. The matching condition could be very flexible, variing from a simple regular expression to a check on the contents of some field, with the possibility of grouping conditions with the proper logical operators.

Awk as a programming language

As any other programming language, awk implements all the necessary flow control structures, as well as a set of operators and predefined functions to deal with numbers and strings. It's possible, of course, to include user defined functions with the keyword function. Apart from the common scalar variables, awk is also able to manage variable sized arrays.

Including libraries

As it happens in any programming language, there are some very common functions and it becomes uncomfortable to cut and paste pieces of code. That's the reason why libraries exist.

With the GNU version of awk, is possible include them within the awk program. This is however an outlook to the things which are possible and outside the scope of this article.

Conclussions
Certainly, awk might not be as poweful as many other tools designed with similar goals. But it has the big advantage that it is possible write in a really short time small programs which are fully tailored to our needs.

AWK is very appropriate for the purposes for which it was build: Read data line by line and act upon the strings and patterns in the lines. Files like /etc/password turn out to be ideal for reformatting and processing with AWK. AWK is invaluable for such tasks. Of course AWK is not alone. Perl is a strong competitor but still it is worthwhile to know some AWK tricks.

Additional information
This kind of very basic commands and is not very well documented, but you can find something when looking around.
y

awk syntax is not the same in every Unix system, but there is a way to learning how is it in our particular system:
man awk

y y

O'Reilly has published a book: Sed & Awk (Nutshell handbook) by Dale Dougherty. Looking at Amazon, we find more titles such as Effective Awk Programming: A User's Guide, oriented to gawk, and half a dozen titles more.

Usually, all books on unix mention this command, but only some of them treat it in detail. The best we can do, is to browse any book we get into our hands. You never know where useful information can be found.

4.2 Examples of print Statements


Each print statement makes at least one line of output. However, it isn't limited to only one line. If an item value is a string that contains a newline, the newline is output along with the rest of the string. A single print statement can make any number of lines this way.

The following is an example of printing a string that contains embedded newlines (the \n is an escape sequence, used to represent the newline character; see Escape Sequences):
$ awk 'BEGIN { print "line one\nline two\nline three" }' -| line one -| line two -| line three

The next example, which is run on the inventory-shipped file, prints the first two fields of each input record, with a space between them:
$ awk '{ print $1, $2 }' inventory-shipped -| Jan 13 -| Feb 15 -| Mar 15 ...

A common mistake in using the print statement is to omit the comma between two items. This often has the effect of making the items run together in the output, with no space. The reason for this is that juxtaposing two string expressions in awk means to concatenate them. Here is the same program, without the comma:
$ awk '{ print $1 $2 }' inventory-shipped -| Jan13 -| Feb15 -| Mar15 ...

To someone unfamiliar with the inventory-shipped file, neither example's output makes much sense. A heading line at the beginning would make it clearer. Let's add some headings to our table of months ($1) and green crates shipped ($2). We do this using the BEGIN pattern (see BEGIN/END) so that the headings are only printed once:
awk 'BEGIN { { print "Month Crates" print "----- ------" } print $1, $2 }' inventory-shipped

When run, the program prints the following:


Month Crates ----- -----Jan 13 Feb 15 Mar 15 ...

The only problem, however, is that the headings and the table data don't line up! We can fix this by printing some spaces between the two fields:
awk 'BEGIN { print "Month Crates"

print "----- ------" } { print $1, " ", $2 }' inventory-shipped

Lining up columns this way can get pretty complicated when there are many columns to fix. Counting spaces for two or three columns is simple, but any more than this can take up a lot of time. This is why the printf statement was created (see Printf); one of its specialties is lining up columns of data.
NOTE: You can continue either a print or printf statement simply by putting a newline after any comma (see Statements/Lines).

You might also like