CPS 444/544 Lecture notes: yacc



Coverage: [UPE] Chapter 8, [CGLY], [CPL] §6.8 (pp. 147-149), §7.3 (pp. 155-156), §A8.3 (p. 212), and §B7 (p. 254)

Acknowledgment: Most of the material in these lecture notes comes from Tom Niemann's A Compact Guide to Lex and Yacc. ePaperPress


Grammar warm up

    Is the following grammar ambiguous?
    E ::= E + E
    E ::= id
    
    How can we fix it?
    E ::= E + F | F
    F ::= id
    
    Now is it left- or right-associative? How about this?
    E ::= F + E | F
    F ::= id
    
    How can we disambiguate our running example?
    E ::= E + E
    E ::= E * E
    E ::= id
    
    How about?
    E ::= E + F | F
    F ::= F * G | G
    G ::= id
    
    Can we do it with only two non-terminals?


yacc conflicts

both are caused by an ambiguous grammar
  • shift-reduce conflict: yacc will always shift
  • reduce-reduce conflict: yacc will always use the first rule to reduce
remedies
  • disambiguate the grammar by rewriting it (the best option)
  • choose a default action by declaring the precedence and associativity of operators as follows:
    %left '+' '-'
    %left '*' '/'
    
    where highest to lowest precedence proceeds bottom-up (i.e., * and / have higher precedence than + and - above)


Essential yacc

  • the parsing technique used by yacc is LALR(1) or Look Ahead Left Recursive; (1) indicates that the lookahead is limited to one token

  • how do lex and yacc communicate? through a global variable named yylval.

  • yacc maintains two internal stacks
    • parse stack
      • contains terminals and non-terminals
      • returned by int yylex()
    • value stack
      • contains values of type YYSTYPE (int by default)
      • YYSTYPE is defined in <filename>.tab.h
      • pushed from variable yylval
      • $$ (top of stack after the reduction takes place), $1, $2, $3, ... $n reference items on the value stack corresponding to the items (from left to right) on the rhs of the production rule used in the reduction




    • these two stacks must always be synchronized

    %token INTEGER
    /* produces "#define INTEGER 258" in calc.tab.c on our system 
       because values 0-255 are reserved for character values, and
       lex reserves several values for end-of-file and error processing and,
       therefore, token values typically start around 258 */
    
  • how to get values of different types on the value stack? use a union
  • 3rd generation languages (e.g., C) and 4th generation languages (yacc)


Marriage of lex and yacc

(adapted version of [CGLY] Fig. 2, p. 5)


How to run yacc (in conjunction with lex) to automatically generate a parser

$ flex tokens.l # produces lex.yy.c
$ bison -d gram.y # produces gram.tab.c and gram.tab.h
$ gcc -c gram.tab.c # produces gram.tab.o
$ gcc -c lex.yy.c # produces lex.yy.o
$ gcc -o parser gram.tab.o lex.yy.o # produces parser
$ ./parser < ...


Evaluating arithmetic expressions

  • expr
    $ expr 2 + 3
    5
    $ expr 2 + 3 \* 4
    14
    $ expr 2 \* 3 + 4
    10
    $ expr "2 + 3 * 4"
    2 + 3 * 4
    
  • bc -l (an arbitrary precision calculator)
    23+47
    70
    2 + 3
    5
    2 + 3 * 4
    14
    2 * 3 + 4
    10
    2 ^ 3
    8
    ^D
    


Makefile for simple calculator (version 1)

    SRC = calc
    CC = gcc
    LEX = flex
    LEX_FLAGS = -d
    YACC = bison
    YACC_FLAGS = -d -t
    
    all: $(SRC)
    
    $(SRC): lex.yy.o $(SRC).tab.o
            $(CC) lex.yy.o $(SRC).tab.o -o $(SRC)
    
    lex.yy.o: lex.yy.c $(SRC).tab.h
            $(CC) -c lex.yy.c
    
    lex.yy.c: $(SRC).l
            $(LEX) $(LEX_FLAGS) $(SRC).l
    
    $(SRC).tab.o: $(SRC).tab.c
            $(CC) -c $(SRC).tab.c
    
    $(SRC).tab.c: $(SRC).y
            $(YACC) $(YACC_FLAGS) $(SRC).y
    
    $(SRC).tab.h: $(SRC).y
            $(YACC) $(YACC_FLAGS) $(SRC).y
    
    clean:
            -rm *.[cho] $(SRC)
    


Programming exercise

Use lex and yacc to generate a parser for the language defined by the following grammar (akin to the parser we generated in class for the balanced, nested parentheses language).
<sentence> ::= <sentence> <expr>
    <expr> ::= ( <list> ) | a
    <list> ::= <expr> | <expr> <list>
We said in class that grammars are also generative devices. Write a C program which utilizes this grammar to generate n sentences from the language. Use those sentences to evaluate the correctness of your parser.


Enhancements to the simple calculator (version 2)

  • multiplication and division: demonstrates setting precedence
  • parentheses to make precedence explicit
  • exponentiation operator
    • has highest precedence
    • right associative
  • unary minus
    • shares highest precedence with the exponentiation operator
    • requires %prec directive
    • to disambiguate
  • single character variables
    • requires building and indexing a symbol table
    • symbol table is indexed by ints
      /* yields an integer in the range 0-25 */
      /* ascii code for character 'a' is 97 */
      /* ascii code for character 't' is 116 */
      yylval = *yytext - 'a';
      
    • requires a <statement> non-terminal
    • requires an assignment operator which is right associative and has lowest precedence
  • print statement


Some conceptual exercises

  • will the '-' expr %prec '^' { $$ = $2*-1; } rule interfere with parsing the print 2 ^ -3; string?
  • what is the difference between '-' expr %prec '^' { $$ = $2*-1; } and '-' expr %prec UMINUS { $$ = $2*-1; }?
  • why is the string print -4 - 5 parsed correctly if unary minus has the highest precedence?
  • when either work just as well, in yacc (bottom-up parsing), which is preferable: a left- or right-recursive grammar?
  • when either work just as well, in top-down parsing, which is preferable: a left- or right-recursive grammar?


unions

  • see [CPL] §6.8 (pp. 147-149) and §A8.3 (p. 212)
  • `A union is a variable that may hold (at different times) objects of different types and sizes, with the compiler keeping track of size and alignment requirements' [CPL] (p. 147).
  • `Unions provide a way to manipulate different kinds of data in a single area of storage, without embedding any machine-dependent information in the program' [CPL] (p. 147).
  • a union variable is large enough to hold the largest of its member types
  • a union is the ideal variable to use for the node in a syntax tree data structure
  • union {
        int i;
        float f;
        char[16] s;
    }
    
  • a union is the C analog of a variant record in Pascal


Variable argument lists

  • see [CPL] §7.3 (pp. 155-156) and §B7 (p. 254)
  • use ellipses in prototype/header
    • int printf (char* fmt, ...)
    • means number and type may vary across calls
    • ellipses must come at the end
  • to step through arguments
    • declare variable of type va_list in function (e.g., va_list ap;)
    • initialize ap with va_start (e.g., va_start (ap);)
    • then call va_arg with ap and a datatype to retrieve a value of that type (e.g., va_arg (ap, int);)
    • call va_end with ap after all arguments have been processed, but before the function returns (e.g., va_end (ap);)
    • if variable types are used, use a switch to control the particular call made to va_arg
  • typically pass the number of variable arguments as a parameter to the function
  • necessary macro, datatypes, and functions are declared in stdarg.h
  • another use: operator node in a parse tree (with a variable number of operands) (see [CGLY])
  • type va_list supports functions accepting a variable number of arguments
  • the macros are defined in stdarg.h
  • void f(int nargs, ...) {
    /* the declaration ... can only appear at the end of an argument list */
    
       int i, tmp;
       
       va_list ap;                /* argument pointer */
    
       va_start(ap, narags);      /* initializes ap to point to the first unnamed argument;
                                     va_start must be called once before ap can be used */
    
       for (i=0; i < nargs; i++)
          temp = va_arg(ap, int); /* returns one argument and steps ap to the next argument */
                                  /* the second argument to va_arg must be a type 
                                     name so that va_args knows how big a step to take */
    
       va_end(ap);                /* clean-up; must be called before function returns */ 
    }
    


Precedence and associativity in version 3

    /* value stack will be an array of these YYSTYPE's;
       has nothing to do with the union in calc.h */
    %union {
       int iValue;       /* integer value */
       char sIndex;      /* symbol table index */
       nodeType* nPtr;   /* node pointer */
    };
    /* generates the following:
    
       typedef union {
          int iValue;
          char sIndex;
          nodeType* nPtr;
       } YYSTYPE;
       extern YYSTYPE yylval; 
    
       in other words, constants, variables, and nodes can
       be represented by yylval in the parser's value stack
    
       binds INTEGER to iValue in the YYSTYPE union
       associates token names with correct component of the YYSTYPE union
       to generate following code
       yylval.nPtr = con(yyvsp[0].iValue); */
    
    %token <iValue> INTEGER
    %token <sIndex> VARIABLE
    %token WHILE IF PRINT
    %nonassoc IFX
    %nonassoc ELSE
    
    %left GE LE EQ NE '>' '<'
    %left '+' '-'
    %left '*' '/'
    %right '^'
    %nonassoc UMINUS
    
    /* binds expr to nPtr in the YYSTYPE union */
    %type <nPtr> stmt expr stmt_list
    
    As a language grows in size and increases in complexity, this approach will not scale. It is always preferable to disambiguate the grammar.


structures for parse tree nodes










Dependency graph for calculator (version 3)


Makefile for calculator language (version 3)

    SRC = calc
    CC = gcc
    LEX = flex
    LEX_FLAGS =
    YACC = bison
    YACC_FLAGS = -d -t
    
    all: interpreter compiler parsetree
    
    interpreter: lex.yy.o $(SRC).tab.o interpreter.o
            $(CC) -lm lex.yy.o $(SRC).tab.o interpreter.o -o interpreter
    
    compiler: lex.yy.o $(SRC).tab.o compiler.o
            $(CC) lex.yy.o $(SRC).tab.o compiler.o -o compiler
    
    parsetree: lex.yy.o $(SRC).tab.o parsetree.o
            $(CC) lex.yy.o $(SRC).tab.o parsetree.o -o parsetree
    
    lex.yy.o: lex.yy.c $(SRC).tab.h $(SRC).h
            $(CC) -c lex.yy.c
    
    lex.yy.c: $(SRC).l
            $(LEX) $(LEX_FLAGS) $(SRC).l
    
    $(SRC).tab.o: $(SRC).tab.c $(SRC).h
            $(CC) -c $(SRC).tab.c
    
    $(SRC).tab.c: $(SRC).y
            $(YACC) $(YACC_FLAGS) $(SRC).y
    
    $(SRC).tab.h: $(SRC).y
            $(YACC) $(YACC_FLAGS) $(SRC).y
    
    interpreter.o: interpreter.c $(SRC).h $(SRC).tab.h
            $(CC) -c interpreter.c
    
    compiler.o: compiler.c $(SRC).h $(SRC).tab.h
            $(CC) -c compiler.c
    
    parsetree.o: parsetree.c $(SRC).h $(SRC).tab.h
            $(CC) -c parsetree.c
    
    clean:
            -rm *.o $(SRC).tab.h $(SRC).tab.c lex.yy.c interpreter compiler parsetree
    


Memory management questions to ponder

  • why is oprNodeType the last field of the union? will this approach work if it is not the last field?
  • are there any other approaches we can take to laying out the memory for these nodes of the parse tree? how about a union of structs? what are the implications?
  • moral of the story: since C is the lowest high-level language (with little type checking), we can manipulate the compiler into laying out memory in an advantageous way based on how we organize/overlap our memory structures in the program codes


Things to do (version 4 and 5)

  • add a do { .. } while ( ... ); loop
  • consolidate the con and id functions into one function
  • consolidate the conNodeType and idNodeType types into a single type con_or_idNodeType (call this version 4)
  • then remove the con_or_idNodeType entirely and simply use an int in the nodeType struct (call this version 5)


References

    [CGLY] T. Niemann. A Compact Guide to Lex and Yacc. ePaperPress.
    [CPL] B.W. Kernighan and D.M. Ritchie. The C Programming Language. Prentice Hall, Upper Saddle River, NJ, Second edition, 1988.
    [UPE] B.W. Kernighan and R. Pike. The UNIX Programming Environment. Prentice Hall, Upper Saddle River, NJ, Second edition, 1984.

Return Home