CPS 444/544 Lecture notes: Regular expressions
Coverage: [UPE] Chapter 4, §4.1 (pp. 101-105)
Regular expressions
- a regular expression (RE)
defines one or more strings of characters;
is said to match any string it defines (e.g.,
/abc/ is an RE which matches abc)
- the strings matched by a regular
expression can be recognized with a
finite state automaton (FSA)
- has limited recognition capabilities (e.g., no memory) and,
therefore, cannot match parentheses
- simple FSA for recognizing a legal C identifier
- built using a combination of literal and metacharacters
- a character is any character except a newline:
a-z A-Z 0-9 () = ; : ,
- a metacharacter (or special character)
is a character which represents something other than itself:
. * [] ^ - $ / + ? | ( ) \{ \}
- a delimiter a special character marking
the start or end of a regular expression;
we use / here
- see regexp(5) manpage
Who /uses/ [Rr]eg.lar [Ee]xpre[s*]ions\?
Regular expressions are used by many
UNIX utilities:
- the shell
- ex (UNIX line editor; interactive)
- vi (UNIX visual editor; interactive)
- emacs (general-purpose editor)
- tr (character translation tool)
- grep (global regular expression print;
file searching tool/utility; returns entire matched line, not just
matched string)
- sed (UNIX stream editor; non-interactive)
- awk (pattern scanning and processing language)
- perl (practical extraction report language; based on the
UNIX shell and
sed and awk)
- py (Python scripting language)
Using grep
quotes are optional around regular expressions which do
not contain spaces or other shell metacharacters
Special characters
- period: .
- matches any single character
/a.c/ matches
abc adc aec a=c a:c
/x..x/ matches
xaax xavx x=kx
- asterisk: *
- matches zero or more occurrences of the previous RE
- notice that this is different than the shell wildcard meaning
/ab*c/ matches ac abc abbc abbbbbbbbbbbbbbbbc
/a*/ matches "" a aa aaaaaaaaaa
/a*b*c*/ matches ?
/.*/ matches ?
- square brackets, the character class symbol: []
- indicates a set of characters, any one of which can match
- * and $ lose their special meaning
- ^ at the start means NOT
- - between characters refers to a range
/[Mm]ark/ matches mark Mark
/t[aeiou]x/ matches tax tex tix tox tux
/[abc].*/ matches anything beginning with a or b
or c
/[a-z][a-z]/ matches any two-letter lower-case string
/[a-zA-Z]*/ matches any word made of letters
/[^abc].*/ matches anything starting with something
besides a or b or c
/[a-zA-Z0-9_]*/ matches ?
- to match a literal ^ in a character class, put it somewhere
other than in the first position (e.g., [a-z^])
- to match a literal - in a character class, put it somewhere
other than in between two characters (e.g., [-a-z])
- all other metacharacters are literal in a character class
- therefore, context matters
- caret: ^;
outside a character class means `beginning of line'
/^T/ matches all lines starting with T
/^[0-9]/ matches ?
- dollar sign: $;
outside of a character class means `end of line'
/T$/ matches all lines ending with T
/^$/ matches ?
- backslash: \; used to escape special characters
/\./ matches .
/a\*b/ matches a*b
RE examples
- social security numbers:
[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9] (yes, it is
rather long winded, but we shorten it; see below)
- legal C identifier: [a-zA-Z_][a-zA-Z0-9_]*
Regular expression rule
Full regular expressions
(used in egrep)
- as oppossed to basic (also called
simple or limited) regular expressions
used in grep
- egrep is extended grep
- plus: +;
like *, but matches one or more occurrences of
the preceding RE
/ab+c/ matches abc abbc abbbc
but not ac
..* = .+
- question mark: ?;
matches zero or one occurrences of the previous RE
/ab?c/ matches ac abc
- logical or: |;
matches either the RE before or the RE after the vertical bar
/abc|def/ matches abc def
- parentheses ( );
can be used to group REs for use with *, ?, +,
|, and so on
/ab(c|d)ef/ matches abcef abdef
/((abcef)|(abdef))/ matches abcef abdef
/ab(cd|de)fg/ matches abcdfg abdefg
depending on the program (see below), you may need to use
\( and \) instead
- set braces \{ \}; used to specify repetitions of a RE
/[0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}/ matches ssns
/a\{4,\}/ matches 4 or more a's (n or more)
/[a-z]\{3,5\}/ matches 3 to 5 lower case letters
(n thru m, with n <= m; range)
- fgrep: self-study
Subtle point about REs
- in grep and ex/vi, ( and )
characters used alone match themselves,
while \( and \) are used for grouping
- egrep uses the opposite conventions
- \{ and \} are special in grep and
ex/vi
- see [UIAN] Chapter 6 (pp. 295-301) and, especially, Tables 6-1 and
6-2 (pp. 296-297)
References
| [UIAN] |
A. Robbins. UNIX in a Nutshell.
O'Reilly, Beijing, Third edition, 1999.
|
| [UPE] |
B.W. Kernighan and R. Pike. The UNIX Programming Environment.
Prentice Hall, Upper Saddle River, NJ, Second edition, 1984.
|
|