CPS 444/544 Lecture notes: yacc
Coverage: [UPE] Chapter 8,
[CGLY],
[CPL] §6.8 (pp. 147-149), §7.3 (pp. 155-156),
§A8.3 (p. 212), and §B7 (p. 254)
Grammar warm up
Is the following grammar ambiguous?
E ::= E + E
E ::= id
How can we fix it?
E ::= E + F | F
F ::= id
Now is it left- or right-associative?
How about this?
E ::= F - E | F
F ::= id
How can we fix it?
E ::= E - F | F
F ::= id
How can we disambiguate our running example?
E ::= E + E
E ::= E * E
E ::= id
How about?
E ::= E + F | F
F ::= F * G | G
G ::= id
Can we do it with only two non-terminals?
yacc conflicts
both are caused by an ambiguous grammar
- shift-reduce conflict: yacc will always shift
- reduce-reduce conflict:
yacc will always use the first rule to reduce
remedies
Essential yacc
- the parsing technique used by yacc is LALR(1) or
Look Ahead Left Recursive; (1) indicates
that the lookahead is limited to one token
- how do lex and yacc
communicate? through a global variable named yylval.
- yacc maintains two internal stacks
- parse stack
- contains terminals and non-terminals
- returned by int yylex()
- value stack
- contains values of type YYSTYPE (int by default)
- YYSTYPE is #defined in
y.tab.h
- pushed from variable yylval
- $$ (top of stack after the reduction takes place),
$1, $2, $3, ... $n reference items on
the value stack corresponding to the items (from left to right)
on the rhs of the production rule used in the reduction
- these two stacks must always be synchronized
%token INTEGER
/* produces "#define INTEGER 258" in calc.tab.c on our system
because values 0-255 are reserved for character values, and
lex reserves several values for end-of-file and error processing and,
therefore, token values typically start around 258 */
- how to get values of different types on the value stack? use a union
Marriage of lex and yacc
(adapted version of [CGLY] Fig. 2, p. 5)
How to run yacc
(to generate the parser)
$ flex tokens.l # produces lex.yy.c
$ bison -d gram.y # produces gram.tab.c and gram.tab.h
$ gcc -c gram.tab.c # produces gram.tab.o
$ gcc -c lex.yy.c # produces lex.yy.o
$ gcc -o parser gram.tab.o lex.yy.o # produces parser
$ ./parser < ...
Evaluating arithmetic expressions
Makefile for simple calculator (version 1)
SRC = calc
CC = gcc
LEX = flex
LEX_FLAGS = -d
YACC = bison
YACC_FLAGS = -d -t
all: $(SRC)
$(SRC): lex.yy.o $(SRC).tab.o
$(CC) lex.yy.o $(SRC).tab.o -o $(SRC)
lex.yy.o: lex.yy.c $(SRC).tab.h
$(CC) -c lex.yy.c
lex.yy.c: $(SRC).l
$(LEX) $(LEX_FLAGS) $(SRC).l
$(SRC).tab.o: $(SRC).tab.c
$(CC) -c $(SRC).tab.c
$(SRC).tab.c: $(SRC).y
$(YACC) $(YACC_FLAGS) $(SRC).y
$(SRC).tab.h: $(SRC).y
$(YACC) $(YACC_FLAGS) $(SRC).y
clean:
-rm *.[cho] $(SRC)
Exercise
Use lex and yacc to generate a parser
for the language defined by the following grammar (akin
to the parser we generated in class for the balanced, nested
parentheses language).
<sentence> ::= <sentence> <expr>
<expr> ::= ( <list> ) | a
<list> ::= <expr> | <expr> <list>
We said in class that grammars are also generative devices. Write a
C program which utilizes the grammar to generate n sentences
from the language. Use those sentences to evaluate the correctness of
your parser.
Enhancements to the simple calculator
(version 2)
- multiplication and division: demonstrates setting precedence
- parentheses to make precedence explicit
- exponentiation operator
- has highest precedence
- right associative
- unary minus
- shares highest precedence with the exponentiation operator
- requires %prec directive
to disambiguate
single character variables
print statement
Other things to know
- will the '-' expr %prec '^' { $$ = $2*-1; } rule interfere
with parsing the print 2 ^ -3; string?
- why is the string print -4 - 5 parsed correctly if
unary minus has the highest precedence?
- when either work just as well, in yacc (bottom-up parsing),
which is preferable: a left- or right-recursive grammar?
- when either work just as well, in top-down parsing,
which is preferable: a left- or right-recursive grammar?
- 3rd generation languages (e.g., C) and 4th generation languages
(yacc)
unions
- see [CPL] §6.8 (pp. 147-149) and §A8.3 (p. 212)
- `A union is a variable that may hold (at different times)
objects of different types and sizes, with the compiler keeping track
of size and alignment requirements' [CPL] (p. 147).
- `Unions provide a way to manipulate different kinds of data in a single
area of storage, without embedding any machine-dependent information in
the program' [CPL] (p. 147).
- a union variable is large enough to hold the largest of its member types
- a union is the ideal variable to use for the node in a syntax tree
data structure
union {
int i;
float f;
char[16] s;
}
a union is the C analog of a variant record in Pascal
Variable argument lists
- see [CPL] §7.3 (pp. 155-156) and §B7 (p. 254)
- use ellipses in prototype/header
- e.g., int printf (char* fmt, ...)
- means number and type may vary across calls
- ellipses must come at the end
- to step through arguments
- declare variable of type va_list in function, e.g.,
va_list ap;
- initialize ap with va_start, e.g.,
va_start (ap);
- then call va_arg with ap and a datatype
to retrieve a value of that type, e.g.,
va_arg (ap, int);
- call va_end with ap
after all arguments have been processed, but
before the function returns, e.g.,
va_end (ap);
- if variable types are used, use a switch
to control the particular call made to va_arg
- typically pass the
number of variable arguments as a parameter to the function
- necessary macro, datatypes, and functions are
declared in stdarg.h
- another use: operator
node in a parse tree (with a variable number of operands) (see
[CGLY])
- type va_list supports functions accepting a
variable number of arguments
- the macros are defined in stdarg.h
void f(int nargs, ...) {
/* the declaration ... can only appear at the end of an argument list */
int i, tmp;
va_list ap; /* argument pointer */
va_start(ap, narags); /* initializes ap to point to the first unnamed argument;
va_start must be called once before ap can be used */
for (i=0; i < nargs; i++)
temp = va_arg(ap, int); /* returns one argument and steps ap to the next argument */
/* the second argument to va_arg must be a type
name so that va_args knows how big a step to take */
va_end(ap); /* clean-up; must be called before function returns */
}
Hacks for precedence and associativity
in version 3
/* value stack will be an array of these YYSTYPE's;
has nothing to do with the union in calc.h */
%union {
int iValue; /* integer value */
char sIndex; /* symbol table index */
nodeType* nPtr; /* node pointer */
};
/* generates the following:
typedef union {
int iValue;
char sIndex;
nodeType* nPtr;
} YYSTYPE;
extern YYSTYPE yylval;
in other words, constants, variables, and nodes can
be represented by yylval in the parser's value stack
binds INTEGER to iValue in the YYSTYPE union
associates token names with correct component of the YYSTYPE union
to generate following code
yylval.nPtr = con(yyvsp[0].iValue); */
%token <iValue> INTEGER
%token <sIndex> VARIABLE
%token WHILE IF PRINT
%nonassoc IFX
%nonassoc ELSE
%left GE LE EQ NE '>' '<'
%left '+' '-'
%left '*' '/'
%right '^'
%nonassoc UMINUS
/* binds expr to nPtr in the YYSTYPE union */
%type <nPtr> stmt expr stmt_list
As a language grows in size and increases in complexity, this
approach will not scale. It is always preferable to disambiguate the
grammar.
Structures for parse tree nodes
Dependency graph for calculator (version 3)
Makefile for calculator
language (version 3)
SRC = calc
CC = gcc
LEX = flex
LEX_FLAGS =
YACC = bison
YACC_FLAGS = -d -t
all: interpreter compiler parsetree
interpreter: lex.yy.o $(SRC).tab.o interpreter.o
$(CC) -lm lex.yy.o $(SRC).tab.o interpreter.o -o interpreter
compiler: lex.yy.o $(SRC).tab.o compiler.o
$(CC) lex.yy.o $(SRC).tab.o compiler.o -o compiler
parsetree: lex.yy.o $(SRC).tab.o parsetree.o
$(CC) lex.yy.o $(SRC).tab.o parsetree.o -o parsetree
lex.yy.o: lex.yy.c $(SRC).tab.h $(SRC).h
$(CC) -c lex.yy.c
lex.yy.c: $(SRC).l
$(LEX) $(LEX_FLAGS) $(SRC).l
$(SRC).tab.o: $(SRC).tab.c $(SRC).h
$(CC) -c $(SRC).tab.c
$(SRC).tab.c: $(SRC).y
$(YACC) $(YACC_FLAGS) $(SRC).y
$(SRC).tab.h: $(SRC).y
$(YACC) $(YACC_FLAGS) $(SRC).y
interpreter.o: interpreter.c $(SRC).h $(SRC).tab.h
$(CC) -c interpreter.c
compiler.o: compiler.c $(SRC).h $(SRC).tab.h
$(CC) -c compiler.c
parsetree.o: parsetree.c $(SRC).h $(SRC).tab.h
$(CC) -c parsetree.c
clean:
-rm *.o $(SRC).tab.h $(SRC).tab.c lex.yy.c interpreter compiler parsetree
Questions to ponder
- why is oprNodeType the last field of the union?
will this approach work if it is not the last field?
- are there any other approaches we can take to laying out the memory for
these nodes of the parse tree?
how about a union of structs? what are the implications?
- moral of the story: since C is the lowest high-level language
(with little type checking), we can manipulate
the compiler into laying out memory in an advantageous way based
on how we organize/overlap our memory structures in the program codes
Things to do
- add a do { .. } while ( ... ); loop
- consolidate the con and id functions into one function
- consolidate the conNodeType and idNodeType
types into a single type (con_or_idNodeType) (call this
version 4)
- then remove the con_or_idNodeType entirely and simply
use an int in the nodeType struct (call this
version 5)
References
| [CGLY] |
T. Niemann. A
Compact Guide to Lex and Yacc. ePaperPress.
|
| [CPL] |
B.W. Kernighan and D.M. Ritchie. The C Programming Language.
Prentice Hall, Upper Saddle River, NJ, Second edition, 1988. |
| [LY] |
J.R. Levine, T. Mason, and D. Brown. Lex and Yacc.
O'Reilly, Cambridge, MA, Second edition, 1995. |
| [UPE] |
B.W. Kernighan and R. Pike. The UNIX Programming Environment.
Prentice Hall, Upper Saddle River, NJ, Second edition, 1984.
|
|