CPS 444/544 Lecture notes: awk



Coverage: [UPE] Chapter 4, §4.4 (pp. 114-131)


Introduction

  • a more powerful sed
  • named after inventors: Aho, Weinberger, and Kernighan
  • follows sed style, but uses C syntax to specify commands
  • powerful for table manipulation and data summarization
  • helpful for processing columns (i.e., extracting, manipulating, or printing columns from input streams using specified delimiters)
  • mini relational database management system
  • awk (like sed) is Turing complete


Execution model

BEGIN {commands executed once before any input is read}
{main input loop executed for each line of input}
END {commands executed once after all input is read}


Simple awking

Consider the following input stream (student.grades):
    Lucy    45      55      60       90
    Linus   70      75      88      100
    Larry   75      80      85      100
    Lucia   80      70      70       95
    
  • the following awk script just cats a file; run it as you would run sed: awk -f <awk script name>:
      { print }
      
    Note that the curly braces contain commands, just as in sed. Since there is nothing before {, these commands are applied to all lines. The only difference is that instead of p in sed, we have print.

  • awk has two special patterns, BEGIN and END, where you can put commands which are executed before any line is read, and after all lines are read, respectively. For example:
      BEGIN {
         print "I am going to start reading a file. Whoopie!"
      } 
      { print }
      END {
         print "I have finished reading the file. Sigh."
      }
      
  • when awk reads a line, it automatically parses the line and puts pieces of the line into defined variables such as $1 (first field), $2 (second field), and so on. The default field separator is a tab (or space). Therefore, the awk script
      { print $1 }
      
    will just print the names. $0 stores the entire line.

  • we can also declare and manipulate variables, just like we would in a C program. The following demonstrates how you will calculate the average value of scores in the first column of numbers (which is actually the second column of the file).
      BEGIN {
         total = 0
         lc = 0
      } {
         total = total + $2
         ++lc
      } END {
         avg = total/lc
         print total, avg
      }
      
  • awk also has system variables to modify the output format (e.g., OFS stands for output field separator); we can set it in the BEGIN part by:
      BEGIN {
         total = 0
         lc = 0
         OFS = "---"
      }
      
    this will affect all subsequent outputs written using the print command; in between two variables (listed in comma separated format), awk will insert the output field separator; similarly, there is a FS which is an input field separator variable which can be used to set the input field separator to a character other than the default whitespace.

  • it is good practice to put one awk command on each line. If you use multiple commands, you will need to use a ; to separate them.


Fine tuning awk

  • character following a -F on the command line specifies the field delimiter (whitespace by default)
    awk -F: '{print $0}' faculty.details
    awk -F: '{print $1" "$2}' faculty.details
    
  • FS variable: the field separator, can be assigned a value
  • OFS variable: the output field separator, can be assigned a value
  • NF variable: stores number of fields in record
  • NR variable: the total number of input records seen so far
  • can use C statements for formatted output (e.g., printf ("%d\n", $1);)


Simple example command lines

    who | awk '{print $1}' # to see who is logged in
    
    who | awk '{print $5}' # to see from where users are logged in
    
    print "$(hostname) has been up for $(uptime | awk '{print $3}') days."
    
    awk '{print}' faculty.details # works like cat
    
    awk -F, '{print $2 " " $1}' guestlist
    
    awk -F, '{print $2, " ", $1}' guestlist # why three spaces between fields in output?
    
    awk -F, '{print $2 " " $1}' guestlist | sort # sorts by first name
    
    awk 'BEGIN {FS=":"} {print NF}' faculty.details
    
    awk 'BEGIN {FS=","; OFS=":"} {print $2, $1}' guestlist
    


Gradebook example

    awk 'BEGIN {
       ns = 0
       total = 0
    } {
       sum = $2 + $3 + $4
       avg = sum / 3
       ns++
       total += avg
       printf ("%d %s: %.2f\n", ns, $1, avg)
    } END { printf ("%d students: %.2f\n", ns, total/ns) }' scores
    
    Assuming that the file scores contains
    Peter 85 90 95
    Paul  25 25 50
    Mary 100 80 60
    
    this awk command generates the output
    1: Peter 90
    2: Paul 33.3333
    3: Mary 80
    3 students: 67.7778
    


More examples

    $ cat ouruniq
    BEGIN {
       prevline = ""
    } {
       if (NR == 1 || $0 != prevline) {
          print $0
          prevline = $0
      }
    } 
    
    $ cat uniq1line
    BEGIN {
       prevline = ""
    } {
       if (NR == 1 || $0 != prevline) {
          printf ("%s ", $0);
          prevline = $0
       }
    } END {
         printf ("\n");
      }
    
    $ sort names | awk -f ouruniq
    $ sort names | awk -f uniq1line
    


References

    [UPE] B.W. Kernighan and R. Pike. The UNIX Programming Environment. Prentice Hall, Upper Saddle River, NJ, Second edition, 1984.

Return Home