CPS 444/544 Lecture notes: Regular expressions



Coverage: [UPE] Chapter 4, §4.1 (pp. 101-105)


Regular expressions

  • a regular expression (RE) defines one or more strings of characters; is said to match any string it defines (e.g., /abc/ is an RE which matches abc)
  • the strings matched by a regular expression can be recognized with a finite state automaton (FSA)
    • has limited recognition capabilities (e.g., no memory) and, therefore, cannot match parentheses
    • simple FSA for recognizing a legal C identifier
  • built using a combination of literal and metacharacters
  • a character is any character except a newline: a-z A-Z 0-9 () = ; : ,
  • a metacharacter (or special character) is a character which represents something other than itself:
      . * [] ^ - $ / + ? | ( ) \{ \}
  • a delimiter is a special character marking the start or end of a regular expression; we use / here
  • see regexp(5) manpage


Who /uses/ [Rr]eg.lar [Ee]xpre[s*]ions\?

Regular expressions are used by many UNIX utilities:

  • the shell
  • ex (UNIX line editor; interactive)
  • vi (UNIX visual editor; interactive)
  • emacs (general-purpose editor)
  • tr (character translation tool)
  • grep (global regular expression print; file searching tool/utility; returns entire matched line, not just matched string)
  • sed (UNIX stream editor; non-interactive)
  • awk (pattern scanning and processing language)
  • perl (practical extraction report language; based on the UNIX shell and sed and awk)
  • py (Python scripting language)


Using grep

  • print lines matching a pattern
  • examples:
      $ grep "abc" filename
      
      prints out all lines in the given file containing abc somewhere in them
      $ grep -i "abc" filename
      
      same as above, but ignores case of the desired string
      $ grep -v "abc" filename
      
      prints out all lines in the given file which do not contain abc anywhere in them
      $ grep -i path .login .tcshrc
      
      $ grep -f searchstrings .login .tcshrc
      
      causes grep to look for search strings in the file following the -f

  • quotes are optional around regular expressions which do not contain spaces or other shell metacharacters


Special characters

  • period: .
    • matches any single character

    • /a.c/ matches abc  adc  aec  a=c  a:c
      /x..x/ matches xaax  xavx  x=kx

  • asterisk: *
    • matches zero or more occurrences of the previous RE
    • notice that this is different than the shell wildcard meaning
    • /ab*c/ matches ac  abc  abbc  abbbbbbbbbbbbbbbbc
      /a*/ matches ""  a  aa  aaaaaaaaaa
      /a*b*c*/ matches ?
      /.*/ matches ?

  • square brackets, the character class symbol: []
    • indicates a set of characters, any one of which can match
    • * and $ lose their special meaning
    • ^ at the start means NOT
    • - between characters refers to a range

    • /[Mm]ark/ matches mark  Mark
      /t[aeiou]x/ matches tax  tex  tix  tox  tux
      /[abc].*/ matches anything beginning with a or b or c
      /[a-z][a-z]/ matches any two-letter lower-case string
      /[a-zA-Z]*/ matches any word made of letters
      /[^abc].*/ matches anything starting with something besides a or b or c
      /[a-zA-Z0-9_]*/ matches ?

    • to match a literal ^ in a character class, put it somewhere other than in the first position (e.g., [a-z^])
    • to match a literal - in a character class, put it somewhere other than in between two characters (e.g., [-a-z])
    • all other metacharacters are literal in a character class
    • therefore, context matters

  • caret: ^; outside a character class means `beginning of line'

    /^T/ matches all lines starting with T
    /^[0-9]/ matches ?

  • dollar sign: $; outside of a character class means `end of line'

    /T$/ matches all lines ending with T
    /^$/ matches ?

  • backslash: \; used to escape special characters

    /\./ matches .
    /a\*b/ matches a*b


RE examples

  • social security numbers: [0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9] (yes, it is rather long winded, but we shorten it; see below)
  • legal C identifier: [a-zA-Z_][a-zA-Z0-9_]*


Regular expression rule

  • REs always match the longest string possible starting from the beginning of the line
  • example:
    This (rug) is not what it once was (a long time ago), is it?
    
    /Th.*is/ matches ?
    /(.*)/ matches ?


Full regular expressions

(used in egrep)


  • as oppossed to basic (also called simple or limited) regular expressions used in grep
  • egrep is extended grep

  • plus: +; like *, but matches one or more occurrences of the preceding RE

    /ab+c/ matches abc  abbc  abbbc but not ac
    ..* = .+

  • question mark: ?; matches zero or one occurrences of the previous RE

    /ab?c/ matches ac  abc

  • logical or: |; matches either the RE before or the RE after the vertical bar

    /abc|def/ matches abc  def

  • parentheses ( ); can be used to group REs for use with *, ?, +, |, and so on

    /ab(c|d)ef/ matches abcef  abdef
    /((abcef)|(abdef))/ matches abcef  abdef
    /ab(cd|de)fg/ matches abcdfg  abdefg

    depending on the program (see below), you may need to use \( and \) instead

  • set braces \{ \}; used to specify repetitions of a RE

    /[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}/ matches ssns
    /a\{4,\}/ matches 4 or more a's (n or more)
    /[a-z]\{3,5\}/ matches 3 to 5 lower case letters (n thru m, with n <= m; range)

  • fgrep: self-study


Subtle point about REs

  • in grep and ex/vi, ( and ) characters used alone match themselves, while \( and \) are used for grouping
  • egrep uses the opposite conventions
  • \{ and \} are special in grep and ex/vi
  • see [UIAN] Chapter 6 (pp. 295-301) and, especially, Tables 6-1 and 6-2 (pp. 296-297)


References

    [UIAN] A. Robbins. UNIX in a Nutshell. O'Reilly, Beijing, Third edition, 1999.
    [UPE] B.W. Kernighan and R. Pike. The UNIX Programming Environment. Prentice Hall, Upper Saddle River, NJ, Second edition, 1984.

Return Home