CPS 444/544 Lecture notes: awk
Coverage: [UPE] Chapter 4, §4.4 (pp. 114-131)
Introduction
- a more powerful sed
- named after inventors: Aho, Weinberger, and Kernighan
- follows sed style, but uses C syntax to
specify commands
- powerful for table manipulation and data summarization
- helpful for processing columns (i.e., extracting, manipulating, or
printing columns from input streams using specified delimiters)
- mini relational database management system
- awk is Turing complete
Execution model
BEGIN {commands executed once before any input is read}
{main input loop executed for each line of input}
END {commands executed once after all input is read}
Simple awking
Consider the following input stream (student.grades):
Lucy 45 55 60 90
Linus 70 75 88 100
Larry 75 80 85 100
Lucia 80 70 70 95
the following awk script just
cats a file; run it as you would run sed:
awk -f <awk script name>:
{ print }
Note that the curly braces contain commands, just as in
sed. Since there is nothing before {,
these commands are applied to all lines.
The only difference is that instead of
p in sed, we have print.
awk has two special patterns, BEGIN
and END, where you can put commands which are
executed before any line is read, and after all lines are
read, respectively. For example:
BEGIN {
print "I am going to start reading a file. Whoopie!"
}
{ print }
END {
print "I have finished reading the file. Sigh."
}
when awk reads a line, it automatically parses
the line and puts pieces of the line into defined variables
such as $1 (first field), $2 (second field), and so on. The
default field separator is a tab (or space). Therefore, the awk script
{ print $1 }
will just print the names.
$0 stores the entire line.
we can also declare and manipulate variables, just like
we would in a C program. The following demonstrates how you will calculate the
average value of scores in the first column of numbers (which
is actually the second column of the file).
BEGIN {
total = 0
lc = 0
}
{
total = total + $2
++lc
}
END {
avg = total/lc
print total, avg
}
awk also has system variables to
modify the output format (e.g., OFS stands for output
field separator); we can set it in the BEGIN part
by:
BEGIN {
total = 0
lc = 0
OFS = "---"
}
this will affect all subsequent outputs written using
the print command; in between two variables (listed
in comma separated format), awk will insert the
output field separator; similarly, there is a FS which
is an input field separator variable which can be used to
set the input field separator to a character other than the default whitespace.
it is good practice to put one awk command
on each line. If you use multiple commands, you will need
to use a ; to separate them.
Fine tuning awk
Simple example command lines
who | awk '{print $1}' # to see who is logged in
who | awk '{print $5}' # to see from where users are logged in
print "$(hostname) has been up for $(uptime | awk '{print $3}') days."
awk '{print}' faculty.details # works like cat
awk -F, '{print $2 " " $1}' guestlist
awk -F, '{print $2, " ", $1}' guestlist # why three spaces between fields in output?
awk -F, '{print $2 " " $1}' guestlist | sort # sorts by first name
awk 'BEGIN {FS=":"} {print NF}' faculty.details
awk 'BEGIN {FS=","; OFS=":"} {print $2, $1}' guestlist
Gradebook example
awk 'BEGIN {
ns = 0
total = 0
}
{
sum = $2 + $3 + $4
avg = sum / 3
ns++
total += avg
printf ("%d %s: %.2f\n", ns, $1, avg)
}
END { printf ("%d students: %.2f\n", ns, total/ns) }' scores
Assuming that the file scores contains
Peter 85 90 95
Paul 25 25 50
Mary 100 80 60
this awk command generates the output
1: Peter 90
2: Paul 33.3333
3: Mary 80
3 students: 67.7778
More examples
$ cat ouruniq
BEGIN {
prevline = ""
} {
if (NR == 1 || $0 != prevline) {
print $0
prevline = $0
}
}
$ cat uniq1line
BEGIN {
prevline = ""
} {
if (NR == 1 || $0 != prevline) {
printf ("%s ", $0);
prevline = $0
}
} END {
printf ("\n");
}
$ sort names | awk -f ouruniq
$ sort names | awk -f uniq1line
References
| [UPE] |
B.W. Kernighan and R. Pike. The UNIX Programming Environment.
Prentice Hall, Upper Saddle River, NJ, Second edition, 1984.
|
|