CPS 343/543, 444/544 Lecture notes: Formal languages and grammars



Coverage:
    (CPS 343/543): [EOPL2] §1.1 (pp. 1-9)
    (CPS 444/544):


Formal languages

  • what is a formal language?
    • set of strings (sentences) over some alphabet
    • legal strings = sentences
  • how do we define a formal language?
    • grammar (define syntax of a language)
    • any more?
  • syntax and semantics
    • syntax refers to structure of language
    • semantics refers to meaning of language
  • previously, both syntax and semantics
    • used to be described intuitively
    • now, well-defined, formal systems are available
  • there are finite and infinite languages
    • is C an infinite language
    • most interesting languages are infinite


Progressive stages of sentence validity

  • preprocessing (purges comments)
  • lexical analysis
  • syntax analysis
  • semantic analysis
  • examples:
    lexically valid?syntactically valid?semantically valid?
    Socrates is a mann. no - -
    Man Socrates is a. yes no -
    Man is a Socrates. yes yes no
    Socrates is a man. yes yes yes


Compilation example



(regenerated with minor modifications from [CGLY] Fig. 1, p. 4)


Execution through interpretation



(adapted version of [EOPL3] Fig. 3.1a, p. 59)


Execution through compilation



(adapted version of [EOPL3] Fig. 3.1b, p. 59)


Regular grammars, regular languages, and lexical analysis

Finite Automata and Regular Expressions (courtesy Randal Nelson and Tom LeBlanc, University of Rochester)
  • what does lexical analysis do?
    • parcel characters into lexemes
    • lexemes fall into token categories
  • example: parceling int i = 20; into lexemes

    lexeme token
    int reserved word
    i identifier
    = special symbol
    20 literal
    ; special symbol

  • principle of longest substring [PLPP] p. 79
  • what delimiter did we use? whitespace?
    • free-format language: formatting has no effect on program structure [PLPP] p. 79
    • fixed-format language: formatting has effect on program structure; early versions of FORTRAN were fixed format (e.g., DO 99 I = 1.10 (DO99I = 1.10 in C) is different from DO 99 I = 1, 10 (for (I=1; I<10; I++) in C))
    • others: Haskell and Python use layout-based (indentation) syntactic grouping
  • reserved words (cannot be used as a name (e.g., int in C)) vs. keywords (only special in certain contexts (e.g., main in C)) [COPL]
  • returns a stream of tokens (why stream of tokens and not stream of lexemes?)
  • lexemes can be formally described by regular grammars
    • . (any single character)
    • * (0 or more of previous character)
    • + (or)
    • shorthand notation:
      • [a-z] (one of the characters in this range)
      • [^a-z] (any character but one in this range)
    • examples:
      • hw* (defines a set of sentences = {h, hw, hww, hwww, hwwww, ...})
      • hw[1-9][0-9]* (defines a set of sentences = {hw1, hw2, ..., hw9, hw10, hw11, ...})
      • ssns? [0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]
      • matching any phrase of exactly three words separated by white space:
        This is a short sentence.
        ---------
             ----------
                ----------------
        
        [^ ][^ ]*[ ][ ]*[^ ][^ ]*[ ][ ]*[^ ][^ ]*
        
  • regular grammars (also called linear grammars) are generative devices for regular languages
  • regular grammars define regular languages
  • any finite language is regular
  • sentences from regular languages are recognized using finite state automata (FSA)
  • example of coding up a lexical analyzer
    • recognizing positive integers and identifiers in C: [1-9][0-9]* + [_a-zA-Z][_a-zA-Z0-9]*
    • lexical analyzer is called a scanner or a lexer
    • example finite state automaton
  • lex (or flex): a UNIX tool which takes a set of regular expressions (in a .l file) and generates a lexical analyzer in C for those; each call to lex() retrieves the next token

  • formal language defined by/generator model of computation/recognizer
    regular language (RL) regular grammar (RG) finite state automata (FSA)


Context-free grammars (Backus-Naur form) and context-free languages

Generating and Recognizing Recursive Descriptions of Patterns with Context-Free Grammars (courtesy Randal Nelson and Tom LeBlanc, University of Rochester)
  • stream of tokens must conform to a grammar (must be arranged in a particular order)
  • grammars define how sentences are constructed
  • defined using a metalanguage notation called Backus-Naur form (BNF)
    • John Backus @ IBM for Algol 58 (1977 ACM A.M. Turing Award winner)
    • Noam Chomsky
    • Peter Naur for Algol 60 (2005 ACM A.M. Turing Award winner)
  • simple grammar for English sentences
    (r1) <sentence> ::= <article> <noun> <verb> <adverb> .
    (r2)  <article> ::= a | an | the
    (r3)     <noun> ::= dog | cat | Socrates
    (r4)     <verb> ::= runs | jumps
    (r5)   <adverb> ::= slow | fast
    
  • elements:
    • grammar = set of production rules,
    • start symbol (<sentence>),
    • non-terminals (<noun>),
    • terminals (cat)
  • another example:
    (r1) <expr> ::= <expr> + <expr>
    (r2) <expr> ::= <expr> * <expr>
    (r3) <expr> ::= <id>
    (r4)   <id> ::= x | y | z
    
  • EBNF: adds {}*, {}*(c), {}+, [ ], and ( | )
    • {}* means 0 or more of enclosed
    • {}+ means 1 or more of enclosed
    • {<expression>}*(c)
    • [ ] means enclosed is optional
    • ( | ) alternation
    • example:
      <expr> ::= ( <list> )
      <expr> ::= a
      <list> ::= <expr>
      <list> ::= <expr> <list>
      
    • EBNF grammar which defines the same language
       <expr> ::= ( <list> ) | a
       <list> ::= <expr> [ <list> ]
      
    • another example:
        <term> ::= <factor> + <factor>
      <factor> ::= <term>
      
    • EBNF grammar which defines the same language
      <term> ::= <factor> + <factor> {+ <factor>}*
      


Language generation and recognition (syntactic analysis or parsing)

  • what can we use grammars for?
  • language generation
    • apply the rules in a top-down fashion
    • construct a derivation
  • deriving sentences from the above grammar; derive "the dog runs fast" (=> means `derive')
    <sentence> => <article> <noun> <verb> <adverb> . (r1)
               => <article> <noun> <verb> fast .     (r5)
               => <article> <noun> runs fast .       (r4)
               => <article> dog runs fast .          (r3)
               => the dog runs fast .                (r2)
    
  • grammar for a simple arithmetic expressions for a simple four-function calculator
    (r1)     <expr> ::= <expr> + <expr>
    (r2)     <expr> ::= <expr> - <expr>
    (r3)     <expr> ::= <expr> * <expr>
    (r4)     <expr> ::= <expr> / <expr>
    (r5)     <expr> ::= <id>
    (r6)       <id> ::= x | y | z
    (r7)     <expr> ::= (<expr>)
    (r8)     <expr> ::= <number>
    (r9)   <number> ::= <number> <digit>
    (r10)i <number> ::= <digit>
    (r11)   <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
    
  • there are leftmost and rightmost derivations
  • some derivations are neither
  • sample derivations of 234:
    • leftmost derivation:
      <expr> => <number>                (r8)
               <number> <digit>         (r9)
               <number> <digit> <digit> (r9)
               <digit> <digit> <digit>  (r10)
               2 <digit> <digit>        (r11)
               23  <digit>              (r11)
               234                      (r11)
      
    • rightmost derivation:
      <expr> => <number>          (r8)
               <number> <digit>   (r9)
               <number> 4         (r11)
               <number> <digit> 4 (r9)
               <number> 34        (r11)
               <digit> 34         (r10)
               234                (r11)
      
    • neither rightmost nor leftmost derivation:
      <expr> => <number>                (r8)
               <number> <digit>         (r9)
               <number> <digit> <digit> (r9)
               <number> <digit> 4       (r11)
               <number> 34              (r11)
               <digit> 34               (r10)
               234                      (r11)
      
    • neither rightmost nor leftmost derivation:
      <expr> => <number>                (r8)
               <number> <digit>         (r9)
               <number> <digit> <digit> (r9)
               <number> 3 <digit>       (r11)
               <digit> 3 <digit>        (r10)
               23 <digit>               (r11)
               234                      (r11)
      
  • derive "x + y * z"
         <expr> => <expr> + <expr>       (r1)
                <expr> + <expr> * <expr> (r3)
                <expr> + <expr> * <id>   (r5)
                <expr> + <expr> * z      (r6)
                <expr> + <id> * z        (r5)
                <expr> + y * z           (r6)
                <id> + y * z             (r5)
                x + y * z                (r6)
    
  • is a grammar a generative device or recognition device? one of the seminal discoveries in computer science
  • language recognition; do the reverse
      generation: grammar → sentence
      recognition: sentence → grammar
  • let's parse x + y * z (do the reverse)
      . x + y * z (shift)
      x . + y * z (reduce r6)
      <id> . + y * z (reduce r5)
      <expr> . + y * z (shift)
      <expr> + . y * z (shift)
      <expr> + y . * z (reduce r6)
      <expr> + <expr> . * z (reduce r5)
      <expr> + <expr> . * z (shift) ← why not reduce r1 here instead?
      <expr> + <expr> * . z (shift)
      <expr> + <expr> * z . (reduce r6)
      <expr> + <expr> * <id> . (reduce r5)
      <expr> + <expr> * <expr> . (reduce r2; emit multiplication)
      <expr> + <expr> . (reduce r1; emit addition)
      <expr> . (start symbol...hurray! this is a valid sentence)
    • . (dot) denotes the top of the stack
    • the rhs is called the handle
    • called bottom-up or shift-reduce parsing
    • construct a parse tree





  • the above parse exhibits a shift-reduce conflict
    • if we shift, multiplication will have higher precedence (desired)
    • if we reduce, addition will have higher precedence (undesired)

  • there is also a reduce-reduce conflict (those the above parse does not have one); consider the following:
    (r1)  <expr> ::= <term>
    (r2)  <expr> ::= <id>
    (r3)  <term> ::= <id>
    (r4)    <id> ::= x | y | z
    
    let's parse x
    . x         (reduce r4)
    <id> .   ← reduce r2 or r3 here?
    

    parse trees for x





  • the underlying source of a shift-reduce conflict and a reduce-reduce conflict is an ambiguous grammar

  • formal language defined by/generator model of computation/recognizer
    regular language (RL) regular grammar (RG) finite state automata (FSA)
    context-free language (CFL) context-free grammar (CFG) pushdown automata (PDA)


Ambiguity

    sentence derivation parse tree meaning
    234 multiple one one (234)
    2+3+4 multiple multiple one (9)
    2+3*4 multiple multiple multiple (20 or 24)
    6-3-2 multiple multiple multiple (1 or 5)

  • last three cases make a grammar ambiguous; if a sentence from a language has more than one parse tree, then the grammar for the language is ambiguous
  • parse tree for "234"



  • parse trees for "2 + 3 + 4"





  • parse trees for "2 + 3 * 4"





  • parse trees for "6 - 3 - 2"





  • let's parse "Time flies like an arrow"
    • 4 different meanings!
    • we say that the grammar is ambiguous!
    • how can we determine intended meaning? need context
  • let's parse "I shot the man on the mountain with the camera."
  • ambiguous grammar
    • a grammar is ambiguous if you can construct at least two parse trees for the same sentence in the language
    • trivial to prove above grammar is ambiguous
  • how to prove a grammar is ambiguous (steps)
    1. generate an expression from the grammar and show the expression
    2. give two parse trees "using the grammar" for that expression
    notes:
    • the expression must come from the grammar
    • a parse tree is fully expanded; it has no leaves which are non-terminals; they are all terminals
    • collected leaves in each parse tree must constitute the expression
    • you cannot change the grammar while building the parse trees
    • you cannot change the expression while building the parse trees
  • we would like part of the meaning (or semantics) to be determined from the grammar (or syntax)
  • desideratum: syntax imply semantics (major complaint against systems like UNIX)
    • precedence
    • associativity
  • what does `have higher precedence' mean? occurs lower in the parse tree because expressions are evaluated bottom-up
  • solution
    • either state a disambiguating rule (order of precedence) (e.g., * has higher precedence than + (most languages, except APL)) or
    • (always possible to) revise the grammar
  • grammar revision
    • introduce new steps (non-terminals) in the non-terminal cascade so that multiplications are always lower than additions in the parse tree
    • worked out solution for 2+3*4
      (r1) <expr> ::= <expr> + <expr>
      (r2) <expr> ::= <expr> - <expr> 
      (r3) <expr> ::= <term>          
      (r4) <term> ::= <term> * <term>
      (r5) <term> ::= <term> / <term> 
      (r6) <term> ::= (<expr>)        
      (r7) <term> ::= <number>       
      
      • this is still ambiguous? yes. why?
      • how can we disambiguate it?
  • associativity
    • comes into play when dealing with operators with same precedence
      • 6-3-2 = (6-3)-2 = 1 (left associative)
      • - - - 6 = -(-(-6))) = -6 (right associative)
    • matters when adding floating-point numbers, or
    • with an operator such as subtraction (e.g., 6-3-2)
    • which operators in C are right-associative?
  • overcoming ambiguity of associativity
    • left-recursive leads to left associativity
    • right-recursive leads to right associativity
  • grammar still ambiguous for 2+3+4 and 6-3-2?
    • let's fix it and make it left-associative
      (r1)   <expr> ::= <expr> + <term>   
      (r2)   <expr> ::= <expr> - <term>  
      (r3)   <expr> ::= <term>            
      (r4)   <term> ::= <term> * <factor> 
      (r5)   <term> ::= <term> / <factor>
      (r6)   <term> ::= <factor>          
      (r7) <factor> ::= (<expr>)         
      (r8) <factor> ::= <number>        
      
    • theme: add another level of indirection by introducing a new non-terminal
    • notice rules get lengthy
    • why do we prefer a small rule set?
  • another example of ambiguity; (<term>)-<term> in C can mean two different things
    • subtracting: (10) - 2
    • typecasting: (int) - 3.5
  • classical example of grammar ambiguity in PLs: the dangling else problem
    • disambiguating grammars
    •    if (a < 2) then
            if (b > 3) then
              x
            else /* associates with which if above ? */
              y
      
    • ambiguous grammar
              <stmt> ::= if <cond> then <stmt>
              <stmt> ::= if <cond> then <stmt> else <stmt>
      
    • parse trees for "if (a < 2) then if (b > 3) then x else y"





    • exercise: develop an unambiguous version
  • disambiguation is a mechanical process: take a compilers course
  • C still uses an ambiguous grammar, why? rules get lengthy and impractical to implement


Syntax analysis


Context-sensitivity

  • an example of a property that is not context-free, or what is an example of something that is context-sensitive?
    • first letter of a sentence must be capitalized
      • Socrates is the boy.
      • The boy is Socrates.
    • an example context-sensitive grammar (CSG) for this:
              <beginning><article> ::= The | An | A
                         <article> ::= the | an | a
      
    • exercise: try expressing this as CFG (hint: it is possible)
    • others:
    • a variable must be declared before it is used
    • * operator in C
  • in this course we will not go beyond CFGs
  • is C a context-free or context-sensitive language (CSL)? it is a CSL implemented with a CFG
  • solutions:
    • use more powerful grammars (CSGs), or
    • use attribute grammars (courtesy Knuth; 1974 ACM A.M. Turing Award winner): CFGs decorated with rules (see [COPL] pp. 130-136)


Chomsky hierarchy

(progressive classes of formal grammars)


  • phrase structured (unrestricted) grammars
    • generate recursively enumerable (unrestricted) languages
    • include all formal grammars
    • implemented with Turing machines
  • context-sensitive grammars
    • generate context-sensitive languages
    • implemented with linear-bounded automata
  • context-free grammars
    • generate context-free languages
    • single non-terminal on left
    • non-terminals & terminals on right
    • implemented with pushdown automata
  • regular grammars
    • generate regular languages
    • implemented with finite state automata

  • formal language defined by/generator model of computation/recognizer
    regular language (RL) regular grammar (RG) finite state automata (FSA)
    context-free language (CFG) context-free grammar (CFG) pushdown automata (PDA)
    context-sensitive language (CSL) context-sensitive grammar (CSG) linear-bounded automata (LBA)
    recursively-enumerable language (REL) unrestricted grammar (UG) Turing machine (TM)


Exercises

  • express .*hw.* as a CFG
  • express <S> ::= () | <S><S> | (<S>) as a regular grammar
    • generates strings of balanced parentheses (no dangling parentheses)
    • of critical importance to programming languages
    • e.g., (()), ()()
  • a CSG can express context (which a CFG cannot). what can a CFG express that a regular grammar cannot? (hint: exercise above gives some clues)


Constructs and capabilities

  • regular grammars can capture rules for a valid identifier in C
  • context-free grammars can capture rules for a valid mathematical expression in C
  • neither can capture fact that a variable must be declared before it is used
  • can push semantic properties like precedence and associativity into the grammar
  • static semantics
    • declaring a variable before its usage
    • type compatibility
    • can use attribute grammars
  • dynamic semantics (in order from least to most formal)
    • operational
    • denotational
    • axiomatic


References

    [CGLY] T. Niemann. A Compact Guide to Lex and Yacc. ePaperPress.
    [COPL] R.W. Sebesta. Concepts of Programming Languages. Addison-Wesley, Boston, MA, Sixth edition, 2003.
    [EOPL2] D.P. Friedman, M. Wand, and C.T. Haynes. Essentials of Programming Languages. MIT Press, Cambridge, MA, Second edition, 2001.
    [EOPL3] D.P. Friedman and M. Wand. Essentials of Programming Languages. MIT Press, Cambridge, MA, Third edition, 2008.
    [IALC] J.E. Hopcroft, R. Motwani, and J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Boston, MA, Third edition, MA, 2006.
    [PLPP] K.C. Louden. Programming Languages: Principles and Practice. Brooks/Cole, Pacific Grove, CA, Second edition, 2002.
    [UPE] B.W. Kernighan and B. Pike. The UNIX Programming Environment Prentice Hall, Upper Saddle River, NJ, 1984.

Return Home