CPS 343/543, 444/544 Lecture notes: Formal languages and grammars
Coverage:
(CPS 343/543): [EOPL2] §1.1 (pp. 19)
(CPS 444/544):
Formal languages
 what is a formal language?
 set of strings (sentences) over some alphabet
 legal strings = sentences
 how do we define a formal language?
 grammar (define syntax of a language)
 any more?
 syntax and semantics
 syntax refers to structure of language
 semantics refers to meaning of language
 previously, both syntax and semantics
 used to be described intuitively
 now, welldefined, formal systems are available
 there are finite and infinite languages
 is C an infinite language
 most interesting languages are infinite
Progressive stages of sentence validity
 preprocessing (purges comments)
 lexical analysis
 syntax analysis
 semantic analysis
examples:
 lexically valid?  syntactically valid?  semantically valid? 
Socrates is a mann.  no     
Man Socrates is a.  yes  no   
Man is a Socrates.  yes  yes  no 
Socrates is a man.  yes  yes  yes 
Compilation example
(regenerated with minor modifications from [CGLY] Fig. 1, p. 4)
Execution through interpretation
(adapted version of [EOPL3] Fig. 3.1a, p. 59)
Execution through compilation
(adapted version of [EOPL3] Fig. 3.1b, p. 59)
Regular grammars,
regular languages, and lexical analysis
Finite Automata and Regular Expressions (courtesy Randal Nelson and Tom LeBlanc,
University of Rochester)
 what does lexical analysis do?
 parcel characters into lexemes
 lexemes fall into token categories
 example: parceling int i = 20; into lexemes
lexeme  token 
int  reserved word 
i  identifier 
=  special symbol 
20  literal 
;  special symbol 
 principle of longest substring [PLPP] p. 79
 what delimiter did we use? whitespace?
 freeformat language: formatting has no effect on program structure
[PLPP] p. 79
 fixedformat language: formatting has effect on program structure;
early versions of FORTRAN were fixed format (e.g.,
DO 99 I = 1.10 (DO99I = 1.10 in C) is different from
DO 99 I = 1, 10 (for (I=1; I<10; I++) in C))
 others: Haskell and Python use layoutbased (indentation)
syntactic grouping
 reserved words (cannot be used as a name (e.g., int
in C)) vs.
keywords (only special in certain contexts (e.g., main in C))
[COPL]
 returns a stream of tokens
(why stream of tokens and not stream of lexemes?)
 lexemes can be formally described by regular grammars
 . (any single character)
 * (0 or more of previous character)
 + (or)
 shorthand notation:
 [az] (one of the characters in this range)
 [^az] (any character but one in this range)
 examples:
 regular grammars (also called linear grammars)
are generative devices for regular languages
 regular grammars define regular languages
 any finite language is regular
 sentences from regular languages are recognized using finite state
automata (FSA)
 example of coding up a lexical analyzer
 recognizing positive integers and identifiers in C: [19][09]* +
[_azAZ][_azAZ09]*
 lexical analyzer is called a scanner or a lexer
 example finite state automaton
 lex
(or flex):
a UNIX tool which takes a set of regular expressions (in a .l file)
and generates a lexical analyzer in C for those;
each call to lex() retrieves the next token
formal language 
defined by/generator 
model of computation/recognizer 
regular language (RL) 
regular grammar (RG) 
finite state automata (FSA) 
Contextfree grammars (BackusNaur form) and
contextfree languages
Generating
and Recognizing Recursive Descriptions of Patterns with ContextFree
Grammars (courtesy Randal Nelson and Tom LeBlanc,
University of Rochester)
 stream of tokens must conform to a grammar
(must be arranged in a particular order)
 grammars define how sentences are constructed
 defined using a metalanguage notation called BackusNaur form (BNF)
 John Backus @ IBM for Algol 58 (1977 ACM A.M. Turing Award winner)
 Noam Chomsky
 Peter Naur for Algol 60 (2005 ACM A.M. Turing Award winner)
 simple grammar for English sentences
(r1) <sentence> ::= <article> <noun> <verb> <adverb> .
(r2) <article> ::= a  an  the
(r3) <noun> ::= dog  cat  Socrates
(r4) <verb> ::= runs  jumps
(r5) <adverb> ::= slow  fast
 elements:
 grammar = set of production rules,
 start symbol (<sentence>),
 nonterminals (<noun>),
 terminals (cat)
 another example:
(r1) <expr> ::= <expr> + <expr>
(r2) <expr> ::= <expr> * <expr>
(r3) <expr> ::= <id>
(r4) <id> ::= x  y  z
 EBNF: adds {}*, {}*(c), {}+, [ ], and (  )
 {}* means 0 or more of enclosed
 {}+ means 1 or more of enclosed
 {<expression>}*(c)
 [ ] means enclosed is optional
 (  ) alternation
 example:
<expr> ::= ( <list> )
<expr> ::= a
<list> ::= <expr>
<list> ::= <expr> <list>
 EBNF grammar which defines the same language
<expr> ::= ( <list> )  a
<list> ::= <expr> [ <list> ]
 another example:
<term> ::= <factor> + <factor>
<factor> ::= <term>
 EBNF grammar which defines the same language
<term> ::= <factor> + <factor> {+ <factor>}*
Language generation and recognition
(syntactic analysis or parsing)
 what can we use grammars for?
 language generation
 apply the rules in a topdown fashion
 construct a derivation
 deriving sentences from the above grammar;
derive "the dog runs fast" (=> means `derive')
<sentence> => <article> <noun> <verb> <adverb> . (r1)
=> <article> <noun> <verb> fast . (r5)
=> <article> <noun> runs fast . (r4)
=> <article> dog runs fast . (r3)
=> the dog runs fast . (r2)
 grammar for a simple arithmetic expressions for
a simple fourfunction calculator
(r1) <expr> ::= <expr> + <expr>
(r2) <expr> ::= <expr>  <expr>
(r3) <expr> ::= <expr> * <expr>
(r4) <expr> ::= <expr> / <expr>
(r5) <expr> ::= <id>
(r6) <id> ::= x  y  z
(r7) <expr> ::= (<expr>)
(r8) <expr> ::= <number>
(r9) <number> ::= <number> <digit>
(r10)i <number> ::= <digit>
(r11) <digit> ::= 0  1  2  3  4  5  6  7  8  9
 there are leftmost and rightmost derivations
 some derivations are neither
 sample derivations of 234:
 leftmost derivation:
<expr> => <number> (r8)
<number> <digit> (r9)
<number> <digit> <digit> (r9)
<digit> <digit> <digit> (r10)
2 <digit> <digit> (r11)
23 <digit> (r11)
234 (r11)
 rightmost derivation:
<expr> => <number> (r8)
<number> <digit> (r9)
<number> 4 (r11)
<number> <digit> 4 (r9)
<number> 34 (r11)
<digit> 34 (r10)
234 (r11)
 neither rightmost nor leftmost derivation:
<expr> => <number> (r8)
<number> <digit> (r9)
<number> <digit> <digit> (r9)
<number> <digit> 4 (r11)
<number> 34 (r11)
<digit> 34 (r10)
234 (r11)
 neither rightmost nor leftmost derivation:
<expr> => <number> (r8)
<number> <digit> (r9)
<number> <digit> <digit> (r9)
<number> 3 <digit> (r11)
<digit> 3 <digit> (r10)
23 <digit> (r11)
234 (r11)
 derive "x + y * z"
<expr> => <expr> + <expr> (r1)
<expr> + <expr> * <expr> (r3)
<expr> + <expr> * <id> (r5)
<expr> + <expr> * z (r6)
<expr> + <id> * z (r5)
<expr> + y * z (r6)
<id> + y * z (r5)
x + y * z (r6)
 is a grammar a generative device or recognition device?
one of the seminal discoveries in computer science
 language recognition; do the reverse
generation: grammar → sentence
recognition: sentence → grammar
 let's parse x + y * z (do the reverse)
. x + y * z  (shift) 
x . + y * z  (reduce r_{6}) 
<id> . + y * z  (reduce r_{5}) 
<expr> . + y * z  (shift) 
<expr> + . y * z  (shift) 
<expr> + y . * z  (reduce r_{6}) 
<expr> + <expr> . * z  (reduce r_{5}) 
<expr> + <expr> . * z  (shift) ← why not reduce r_{1} here instead? 
<expr> + <expr> * . z  (shift) 
<expr> + <expr> * z .  (reduce r_{6}) 
<expr> + <expr> * <id> .  (reduce r_{5}) 
<expr> + <expr> * <expr> .  (reduce r_{2}; emit multiplication) 
<expr> + <expr> .  (reduce r_{1}; emit addition) 
<expr> .  (start symbol...hurray! this is a valid sentence) 
 . (dot) denotes the top of the stack
 the rhs is called the handle
 called bottomup or shiftreduce parsing
 construct a parse tree
 the above parse exhibits a shiftreduce conflict
 if we shift, multiplication will have higher precedence (desired)
 if we reduce, addition will have higher precedence (undesired)
 there is also a reducereduce conflict (those the above
parse does not have one); consider the following:
(r1) <expr> ::= <term>
(r2) <expr> ::= <id>
(r3) <term> ::= <id>
(r4) <id> ::= x  y  z
let's parse x
. x (reduce r4)
<id> . ← reduce r2 or r3 here?
parse trees for x
 the underlying source of a shiftreduce conflict and
a reducereduce conflict is an ambiguous grammar
formal language 
defined by/generator 
model of computation/recognizer 
regular language (RL) 
regular grammar (RG) 
finite state automata (FSA) 
contextfree language (CFL) 
contextfree grammar (CFG) 
pushdown automata (PDA) 
Ambiguity
disambiguation is a mechanical process: take a compilers course
C still uses an ambiguous grammar, why?
rules get lengthy and impractical to implement
Syntax analysis
Contextsensitivity
 an example of a property that is not contextfree, or
what is an example of something that is contextsensitive?
 in this course we will not go beyond CFGs
 is C a contextfree or contextsensitive language (CSL)?
it is a CSL implemented with a CFG
 solutions:
 use more powerful grammars (CSGs), or
 use attribute grammars (courtesy Knuth; 1974 ACM A.M.
Turing Award winner): CFGs decorated with rules
(see [COPL] pp. 130136)
Chomsky hierarchy
(progressive classes of formal grammars)
 phrase structured (unrestricted) grammars
 generate recursively enumerable (unrestricted) languages
 include all formal grammars
 implemented with Turing machines
 contextsensitive grammars
 generate contextsensitive languages
 implemented with linearbounded automata
 contextfree grammars
 generate contextfree languages
 single nonterminal on left
 nonterminals & terminals on right
 implemented with pushdown automata
 regular grammars
 generate regular languages
 implemented with finite state automata
formal language 
defined by/generator 
model of computation/recognizer 
regular language (RL) 
regular grammar (RG) 
finite state automata (FSA) 
contextfree language (CFG) 
contextfree grammar (CFG) 
pushdown automata (PDA) 
contextsensitive language (CSL) 
contextsensitive grammar (CSG) 
linearbounded automata (LBA) 
recursivelyenumerable language (REL) 
unrestricted grammar (UG) 
Turing machine (TM) 
Exercises
 express .*hw.* as a CFG
 express <S> ::= ()  <S><S>  (<S>) as a regular
grammar
 generates strings of balanced parentheses (no dangling parentheses)
 of critical importance to programming languages
 e.g., (()), ()()
 a CSG can express context (which a CFG cannot).
what can a CFG express that a regular grammar cannot?
(hint: exercise above gives some clues)
Constructs and capabilities
 regular grammars can capture rules for a valid identifier in C
 contextfree grammars can capture rules
for a valid mathematical expression in C
 neither can capture
fact that a variable must be declared before it is used
 can push semantic properties like
precedence and associativity into the grammar
 static semantics
 declaring a variable before its usage
 type compatibility
 can use attribute grammars
 dynamic semantics (in order from least to most formal)
 operational
 denotational
 axiomatic
References
[CGLY] 
T. Niemann. A
Compact Guide to Lex and Yacc. ePaperPress.

[COPL] 
R.W. Sebesta. Concepts of Programming Languages.
AddisonWesley, Boston, MA, Sixth edition, 2003. 
[EOPL2] 
D.P. Friedman, M. Wand, and C.T. Haynes.
Essentials of Programming Languages.
MIT Press, Cambridge, MA, Second edition, 2001. 
[EOPL3] 
D.P. Friedman and M. Wand.
Essentials of Programming Languages.
MIT Press, Cambridge, MA, Third edition, 2008. 
[IALC] 
J.E. Hopcroft, R. Motwani, and J.D. Ullman.
Introduction to Automata Theory, Languages, and Computation.
AddisonWesley, Boston, MA, Third edition, MA, 2006.

[PLPP] 
K.C. Louden.
Programming Languages: Principles and Practice.
Brooks/Cole, Pacific Grove, CA, Second edition, 2002.

[UPE] 
B.W. Kernighan and B. Pike.
The UNIX Programming Environment
Prentice Hall, Upper Saddle River, NJ, 1984. 
