CPS 343/543, 444/544 Lecture notes: Formal languages and grammars
Coverage:
(CPS 343/543): [EOPL2] §1.1 (pp. 1-9)
(CPS 444/544):
Formal languages
- what is a formal language?
- set of strings (sentences) over some alphabet
- legal strings = sentences
- how do we define a formal language?
- grammar (define syntax of a language)
- any more?
- syntax and semantics
- syntax refers to structure of language
- semantics refers to meaning of language
- previously, both syntax and semantics
- used to be described intuitively
- now, well-defined, formal systems are available
- there are finite and infinite languages
- is C an infinite language
- most interesting languages are infinite
Progressive stages of sentence validity
- preprocessing (purges comments)
- lexical analysis
- syntax analysis
- semantic analysis
examples:
| lexically valid? | syntactically valid? | semantically valid? |
| Socrates is a mann. | no | - | - |
| Man Socrates is a. | yes | no | - |
| Man is a Socrates. | yes | yes | no |
| Socrates is a man. | yes | yes | yes |
Compilation example
(regenerated with minor modifications from [CGLY] Fig. 1, p. 4)
Execution through interpretation
(adapted version of [EOPL3] Fig. 3.1a, p. 59)
Execution through compilation
(adapted version of [EOPL3] Fig. 3.1b, p. 59)
Regular grammars,
regular languages, and lexical analysis
Finite Automata and Regular Expressions (courtesy Randal Nelson and Tom LeBlanc,
University of Rochester)
- what does lexical analysis do?
- parcel characters into lexemes
- lexemes fall into token categories
- example: parceling int i = 20; into lexemes
| lexeme | token |
| int | reserved word |
| i | identifier |
| = | special symbol |
| 20 | literal |
| ; | special symbol |
- principle of longest substring [PLPP] p. 79
- what delimiter did we use? whitespace?
- free-format language: formatting has no effect on program structure
[PLPP] p. 79
- fixed-format language: formatting has effect on program structure;
early versions of FORTRAN were fixed format (e.g.,
DO 99 I = 1.10 (DO99I = 1.10 in C) is different from
DO 99 I = 1, 10 (for (I=1; I<10; I++) in C))
- others: Haskell and Python use layout-based (indentation)
syntactic grouping
- reserved words (cannot be used as a name (e.g., int
in C)) vs.
keywords (only special in certain contexts (e.g., main in C))
[COPL]
- returns a stream of tokens
(why stream of tokens and not stream of lexemes?)
- lexemes can be formally described by regular grammars
- . (any single character)
- * (0 or more of previous character)
- + (or)
- shorthand notation:
- [a-z] (one of the characters in this range)
- [^a-z] (any character but one in this range)
- examples:
- regular grammars (also called linear grammars)
are generative devices for regular languages
- regular grammars define regular languages
- any finite language is regular
- sentences from regular languages are recognized using finite state
automata (FSA)
- example of coding up a lexical analyzer
- distinguishing between positive numbers
[1-9][0-9]* and identifiers ([_a-zA-Z][_a-zA-Z0-9]*)
- lexical analyzer is called a scanner or a lexer
- example finite state automaton
- lex
(or flex):
a UNIX tool which takes a set of regular expressions (in a .l file)
and generates a lexical analyzer in C for those;
each call to lex() retrieves the next token
| formal language |
defined by/generator |
model of computation/recognizer |
| regular language (RL) |
regular grammar (RG) |
finite state automata (FSA) |
Context-free grammars (Backus-Naur form) and
context-free languages
Generating
and Recognizing Recursive Descriptions of Patterns with Context-Free
Grammars (courtesy Randal Nelson and Tom LeBlanc,
University of Rochester)
- stream of tokens must conform to a grammar
(must be arranged in a particular order)
- grammars define how sentences are constructed
- defined using a metalanguage notation called Backus-Naur form (BNF)
- John Backus @ IBM for Algol 58 (1977 ACM A.M. Turing Award winner)
- Noam Chomsky
- Peter Naur for Algol 60 (2005 ACM A.M. Turing Award winner)
- simple grammar for English sentences
(r1) <sentence> ::= <article> <noun> <verb> <adverb> .
(r2) <article> ::= a | an | the
(r3) <noun> ::= dog | cat | Socrates
(r4) <verb> ::= runs | jumps
(r5) <adverb> ::= slow | fast
- elements:
- grammar = set of production rules,
- start symbol (<sentence>),
- non-terminals (<noun>),
- terminals (cat)
- another example:
(r1) <expr> ::= <expr> + <expr>
(r2) <expr> ::= <expr> * <expr>
(r3) <expr> ::= <id>
(r4) <id> ::= x | y | z
- EBNF: adds {}*, {}*(c), {}+, [ ], and ( | )
- {}* means 0 or more of enclosed
- {}+ means 1 or more of enclosed
- {<expression>}*(c)
- [ ] means enclosed is optional
- ( | ) alternation
- example:
<expr> ::= ( <list> )
<expr> ::= a
<list> ::= <expr>
<list> ::= <expr> <list>
- EBNF grammar which defines the same language
<expr> ::= ( <list> ) | a
<list> ::= <expr> [ <list> ]
- another example:
<term> ::= <factor> + <factor>
<factor> ::= <term>
- EBNF grammar which defines the same language
<term> ::= <factor> + <factor> {+ <factor>}*
Language generation and recognition
(syntactic analysis or parsing)
- what can we use grammars for?
- language generation
- apply the rules in a top-down fashion
- construct a derivation
- deriving sentences from the above grammar;
derive "the dog runs fast" (=> means `derive')
<sentence> => <article> <noun> <verb> <adverb> . (r1)
=> <article> <noun> <verb> fast . (r5)
=> <article> <noun> runs fast . (r4)
=> <article> dog runs fast . (r3)
=> the dog runs fast . (r2)
- grammar for a simple arithmetic expressions for
a simple four-function calculator
(r1) <expr> ::= <expr> + <expr>
(r2) <expr> ::= <expr> - <expr>
(r3) <expr> ::= <expr> * <expr>
(r4) <expr> ::= <expr> / <expr>
(r5) <expr> ::= <id>
(r6) <id> ::= x | y | z
(r7) <expr> ::= (<expr>)
(r8) <expr> ::= <number>
(r9) <number> ::= <number> <digit>
(r10)i <number> ::= <digit>
(r11) <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
- there are leftmost and rightmost derivations
- some derivations are neither
- sample derivations of 234:
- leftmost derivation:
<expr> => <number> (r8)
<number> <digit> (r9)
<number> <digit> <digit> (r9)
<digit> <digit> <digit> (r10)
2 <digit> <digit> (r11)
23 <digit> (r11)
234 (r11)
- rightmost derivation:
<expr> => <number> (r8)
<number> <digit> (r9)
<number> 4 (r11)
<number> <digit> 4 (r9)
<number> 34 (r11)
<digit> 34 (r10)
234 (r11)
- neither rightmost nor leftmost derivation:
<expr> => <number> (r8)
<number> <digit> (r9)
<number> <digit> <digit> (r9)
<number> <digit> 4 (r11)
<number> 34 (r11)
<digit> 34 (r10)
234 (r11)
- neither rightmost nor leftmost derivation:
<expr> => <number> (r8)
<number> <digit> (r9)
<number> <digit> <digit> (r9)
<number> 3 <digit> (r11)
<digit> 3 <digit> (r10)
23 <digit> (r11)
234 (r11)
- derive "x + y * z"
<expr> => <expr> + <expr> (r1)
<expr> + <expr> * <expr> (r2)
<expr> + <expr> * <id> (r5)
<expr> + <expr> * z (r6)
<expr> + <id> * z (r5)
<expr> + y * z (r6)
<id> + y * z (r5)
x + y * z (r6)
- is a grammar a generative device or recognition device?
one of the seminal discoveries in computer science
- language recognition; do the reverse
generation: grammar → sentence
recognition: sentence → grammar
- let's parse x + y * z (do the reverse)
. x + y * z (shift)
x . + y * z (reduce r6)
<id> . + y * z (reduce r5)
<expr> . + y * z (shift)
<expr> + . y * z (shift)
<expr> + y . * z (reduce r6)
<E> + <I> . * z (reduce r5)
<E> + <E> . * z (shift) ← why not reduce r1 here instead?
<E> + <E> * . z (shift)
<E> + <E> * z . (reduce r6)
<E> + <E> * <I> . (reduce r5)
<E> + <E> * <E> . (reduce r2; emit multiple)
<E> + <E> . (reduce r1; emit addition)
<E> . (start symbol...hurray! this is a valid sentence)
- . (dot) denotes the top of the stack
- the rhs is called the handle
- called bottom-up or shift-reduce parsing
- construct a parse tree
the above parse exhibits a shift-reduce conflict
- if we shift, multiplication will have higher precedence (desired)
- if we reduce, addition will have higher precedence (undesired)
there is also a reduce-reduce conflict (those the above
parse does not have one); consider the following:
(r1) <expr> ::= <term>
(r2) <expr> ::= <id>
(r3) <term> ::= <id>
(r4) <id> ::= x | y | z
let's parse x
. x (reduce r4)
<id> . ← reduce r2 or r3 here?
parse trees for x
the underlying source of a shift-reduce conflict and
a reduce-reduce conflict is an ambiguous grammar
| formal language |
defined by/generator |
model of computation/recognizer |
| regular language (RL) |
regular grammar (RG) |
finite state automata (FSA) |
| context-free language (CFL) |
context-free grammar (CFG) |
pushdown automata (PDA) |
Ambiguity
disambiguation is a mechanical process: take a compilers course
C still uses an ambiguous grammar, why?
rules get lengthy and impractical to implement
Syntax analysis
- building up a parse tree
- or just simply checking for validity
- need not always actually build the tree;
sometimes a traversal is enough, especially if you are not going on to
semantic analysis or code generation
- a syntactic analyzer is called a parser
- yacc
(or bison):
a UNIX tool which takes a BNF grammar (in a .y file)
and generates a parser in C for the language it defines
- ambiguous grammar: small and leads to a fast parser, but is ambiguous
- unambiguous grammar: large and leads a slow parser, but has no ambiguity
Context-sensitivity
- an example of a property that is not context-free, or
what is an example of something that is context-sensitive?
in this course we will not go beyond CFGs
is C a context-free or context-sensitive language (CSL)?
it is a CSL implemented with a CFG
solutions:
- use more powerful grammars (CSGs), or
- use attribute grammars (courtesy Knuth; 1974 ACM A.M.
Turing Award winner): CFGs decorated with rules
(see [COPL] pp. 130-136)
Chomsky hierarchy
(progressive classes of formal grammars)
- phrase structured (unrestricted) grammars
- generate recursively enumerable (unrestricted) languages
- include all formal grammars
- implemented with Turing machines
- context-sensitive grammars
- generate context-sensitive languages
- implemented with linear-bounded automata
- context-free grammars
- generate context-free languages
- single non-terminal on left
- non-terminals & terminals on right
- implemented with pushdown automata
- regular grammars
- generate regular languages
- implemented with finite state automata
| formal language |
defined by/generator |
model of computation/recognizer |
| regular language (RL) |
regular grammar (RG) |
finite state automata (FSA) |
| context-free language (CFG) |
context-free grammar (CFG) |
pushdown automata (PDA) |
| context-sensitive language (CSL) |
context-sensitive grammar (CSG) |
linear-bounded automata (LBA) |
| recursively-enumerable language (REL) |
unrestricted grammar (UG) |
Turing machine (TM) |
Exercises
- express .*hw.* as a CFG
- express <S> ::= () | <S><S> | (<S>) as a regular
grammar
- generates strings of balanced parentheses (no dangling parentheses)
- of critical importance to programming languages
- e.g., (()), ()()
- a CSG can express context (which a CFG cannot).
what can a CFG express that a regular grammar cannot?
(hint: exercise above gives some clues)
Constructs and capabilities
- regular grammars can capture rules for a valid identifier in C
- context-free grammars can capture rules
for a valid mathematical expression in C
- neither can capture
fact that a variable must be declared before it is used
- can push semantic properties like
precedence and associativity into the grammar
- static semantics
- declaring a variable before its usage
- type compatibility
- can use attribute grammars
- dynamic semantics (in order from least to most formal)
- operational
- denotational
- axiomatic
References
| [CGLY] |
T. Niemann. A
Compact Guide to Lex and Yacc. ePaperPress.
|
| [COPL] |
R.W. Sebesta. Concepts of Programming Languages.
Addison-Wesley, Boston, MA, Sixth edition, 2003. |
| [EOPL2] |
D.P. Friedman, M. Wand, and C.T. Haynes.
Essentials of Programming Languages.
MIT Press, Cambridge, MA, Second edition, 2001. |
| [EOPL3] |
D.P. Friedman and M. Wand.
Essentials of Programming Languages.
MIT Press, Cambridge, MA, Third edition, 2008. |
| [IALC] |
J.E. Hopcroft, R. Motwani, and J.D. Ullman.
Introduction to Automata Theory, Languages, and Computation.
Addison-Wesley, Boston, MA, Third edition, MA, 2006.
|
| [PLPP] |
K.C. Louden.
Programming Languages: Principles and Practice.
Brooks/Cole, Pacific Grove, CA, Second edition, 2002.
|
| [UPE] |
B.W. Kernighan and B. Pike.
The UNIX Programming Environment
Prentice Hall, Upper Saddle River, NJ, 1984. |
|