YACC - Use and Operation - 2 - 1. Introduction This paper describes the use and operation of a LALR(1) parser generator YACC (Yet Another Compiler-Compiler). YACC accepts as input a BNF-like grammar and, if possible, produces as output a set of tables for a table-driven shift-reduce parsing routine. The parsing routine and the tables together form a parser which recognizes the language defined by the grammar. The parser generated by YACC can be used as the core of a syntax analyzer by including in the grammar calls to user-provided action routines. These calls are made by the parser at the appropriate points in the analysis of the input string. The class of LALR(1) grammars is a subclass of the class of LR(1) grammars, those which can be parsed by a deterministic bottom-up parser using one symbol of lookahead. The LALR(1) grammars are those LR(1) grammars for which a parser can be constructed by a relatively efficient process. Theoretically, all deterministic context-free languages have a LR(1) grammar, but not necessarily a LALR(1) grammar. Practically, however, it has been observed that most common programming languages have "natural" grammars which are easily converted to be LALR(1). The original YACC was designed and implemented on a PDP-11/45 and a Honeywell 6000 by S. C. Johnson at Bell Laboratories. The version described in this paper was implemented on the PDP-10 by Alan Snyder. - 3 - 2. Using YACC In the simplest case, the input to YACC is a file containing a BNF-like grammar for the language. The grammar consists of a sequence of rules, which have the following syntax: rule: lhs ':' rhs_list lhs: symbol rhs_list: rhs | rhs_list '|' rhs rhs: symbol_sequence symbol_sequence: symbol | symbol_sequence symbol The above rules for rules are examples of rules. Another example is the following simple grammar for expressions: e: e '+' t | e '-' t | t t: t '*' p | t '/' p | p p: idn | '(' e ')' A symbol is any sequence of alphanumeric characters, including underlines, dollar signs, and periods. In addition, a symbol may be any sequence of characters enclosed in single quotes. The symbols which appear as the left-hand-sides of rules are the non-terminal symbols; all other symbols appearing in the grammar are assumed to be terminal symbols. The symbol appearing as the left-hand-side of the first rule is considered to be the start symbol of the grammar. After a file containing the grammar has been prepared, YACC may be run. YACC will respond by asking for the name of the file containing the grammar. After the file name is entered, YACC will analyze the grammar and construct the parsing tables. YACC will print some messages on the terminal to indicate its progress. When it has finished, a listing will have been placed on the file YACC OUTPUT and the parsing tables will have been written onto the file YACC TABLES. In the process of constructing a parser for the grammar, YACC may discover conflicts in the grammar. These conflicts indicate that the grammar is not LALR(1). The conflicts, which are listed in the OUTPUT file, may be of two types. The first type of conflict is a shift/reduce conflict, abbreviated S/R. A shift/reduce conflict indicates that, in the given state and with the given input symbol, the constructed parser could legitimately either shift the input symbol onto the stack or make an immediate reduction. Shift/reduce conflicts are resolved by YACC in favor of shifting. The second type of conflict is a reduce/reduce conflict, abbreviated R/R. A reduce/reduce conflict indicates that, in the given state and with the given input symbol, the parser could legitimately make either of two reductions. Reduce/reduce conflicts are resolved by YACC in favor of the production appearing earlier in the input file. The relation of a conflict to a problem in the grammar can be - 4 - determined by examining the description of the particular state in the action table section of the OUTPUT file. The first part of the description is a set of items, where an item is a rule which contains a marker ('.') in the right-hand-side. The marker indicates how much of the right-hand-side has been seen by the parser when the parser is in that state. Thus, the collection of items represents the set of possibilities being considered by the parser when in that state. A conflict indicates that the parser cannot discard one of two possibilities on the basis of the current input symbol, yet any action it takes will have the effect of eliminating one of the two possibilities. - 5 - 3. Interfacing with a Lexical Analyzer The parsing tables produced by YACC are in the form of a C program, ready to be compiled by the C compiler (CC). This C program may be loaded together with the compiled version of a parsing routine in order to construct a working parser for the language. A standard parsing routine, called PARSE, may be found in the file "YPARSE.C". PARSE assumes the existence of a lexical routine, called GETTOK, which it can call in order to obtain the next terminal symbol from the input stream. GETTOK is expected to set the values of three integer global variables, LEXTYPE, LEXINDEX, and LEXLINE. LEXTYPE should be set to an integer which distinguishes which terminal symbol has been read. The correspondence between integers and terminal symbols is listed in the OUTPUT file produced by YACC. However, it is more convenient when an actual parser is to be constructed to specify in the grammar the correspondence between integers and terminal symbols. This is done by listing at the beginning of the file the terminal symbols of the grammar. They will be numbered consecutively, starting with 3. (The integer 1 is to be returned by the lexical routine to indicate the end of the input stream; the integer 2 is reserved for an error recovery method.) The listing of terminal symbols in the grammar should be separated from the list of rules by the symbol '\\'. For example, the grammar '+' '-' '*' '/' '(' ')' idn \\ e: e '+' t | e '-' t | t t: t '*' p | t '/' p | p p: idn | '(' e ')' defines the following representations of terminal symbols: eof 1 + 3 - 4 * 5 / 6 ( 7 ) 8 idn 9 The variable LEXLINE should be set to the line number in the input file on which the terminal symbol being returned appeared; this value is used by PARSE when reporting syntax errors and is made available to any action routines. The variable LEXINDEX is used only when performing translations (see next section). In addition, PARSE requires a routine PTOKEN which will print some symbolic representation of a token; this routine is used when reporting syntax errors. - 6 - 4. Performing Translations As described so far, the parser performs only recognition; that is, given an input string of terminal symbols, it will produce error messages if the string is not in the language defined by the grammar and do nothing otherwise. YACC is capable also of producing tables for a parser which performs translations, for example, the syntax analyzer of a compiler. The following extension is made in order to support translation: the parser associates with each terminal symbol (received from the lexical routine) and each nonterminal symbol (resulting from a reduction) a word (integer, pointer) called a translation element. The translation element for a terminal symbol is produced by the lexical routine; it is communicated to PARSE via the global variable LEXINDEX. Typically, the translation element for a terminal symbol is used to distinguish between different identifiers and constants. The translation element for a nonterminal symbol is obtained by calling a user-provided action routine when a reduction is made which produces the nonterminal symbol. This action routine is specified by following the production rule in the grammar with the body of the routine, enclosed in braces. The action routine may access the translation elements associated with the symbols on the right- hand-side of the production using the notation #n, where "n" is the number of the symbol (i.e., #1 refers to the translation element for the first symbol of the right-hand-side). The action routine specifies the value for the left-hand-side by setting the global variable VAL. A typical action routine in a parser which produces tree representations is {val=node(node_type,#1,#2,#3);} where node is a routine which constructs nodes of the tree and node_type is a tag which indicates the type of the node. An action routine may also specify a line-number to be associated with the left-hand-side by setting the global variable LINE; the line-numbers of the symbols on the right-hand-side are accessible through the global variable PL (i.e., pl[3] refers to the line- number of the third symbol on the right-hand-side). - 7 - 5. Disambiguation YACC is capable of disambiguating ambiguous grammars through the use of precedence and associativity information. This is especially useful in the case of arithmetic expressions since it allows a much simpler grammar to be used. For example, the grammar for expressions given above could be written: '+' '-' '*' '/' '(' ')' idn \< '+' '-' \< '*' '/' \\ e: e '+' e | e '-' e | e '*' e | e '/' e | idn | '(' e ')' The two lines following the list of terminal symbols create two levels of precedence in increasing order and assign those levels to the terminal symbols appearing on those lines. The '\<' which begins a new precedence level also indicates left-association. One may also specify '\>' for right-association and '\2' indicating that association is not permitted (is to be regarded as a syntax error). This last feature may be used to prohibit the misleading association of operators such as comparision operators. - 8 - 6. The Operation of YACC The operation of YACC is performed in five steps. First, the input file is read and an internal representation of the grammar is created. Second, certain auxiliary data structures are constructed which contain information about the grammar which is used by later steps. Third, the canonical LR(0) parser for the grammar is constructed. Fourth, the LR(0) parser is analyzed by computing and applying lookahead in order to resolve conflicts in the LR(0) parser. Finally, a listing is written onto the OUTPUT file containing the remaining conflicts in the parser, the grammar, and the parser itself, and the tables are written onto the TABLES file. 6.1 Constructing the Canonical LR(0) Parser The canonical LR(0) parser for the grammar is constructed by the following method: First, the grammar is augmented by adding a production $accept: S -| where the symbol $accept is a distinguished nonterminal added by YACC, S represents the starting symbol of the original grammar, and -| represents the end-of-file symbol. Second, the initial state of the parser is created containing the item $accept -> . S -| and its closure. The closure of a set of items I is defined to be the smallest set of items C containing I such that if C contains an item of the form A -> a . B b for some nonterminal B and strings a and b, then C contains all items of the form B -> . w for string w. The final step in constructing the canonical LR(0) parser consists of constructing the set of states accesible from the initial state. The set of accesible states is defined to be the smallest set of states containing the initial state such that for each state i in S, if j is the successor state of i on some symbol x, then j is in S. The successor state j of a state i on a symbol x is constructed in two steps: First, for each item in state i of the form A -> a . x b for nonterminal A and strings a and b, the item A -> a x . b - 9 - is added to state j. Second, the closure of the set of items in state j is added to state j. 6.2 Applying Lookahead to the LR(0) Parser The constructed LR(0) parser will generally contain conflicts, that is, states in which more than one action is valid for some input symbol. An item of the form A -> a . is called a reduce item (reduction) since it indicates that the entire right-hand-side of a rule has been recognized and can be reduced to the left-hand-side. An item of the form A -> a . x b where x is a terminal symbol, is called a shift item since it indicates that if x is the current input symbol, then it should be shifted onto the stack and control passed to the x-successor state, which will contain the item A -> a x . b If a state in the LR(0) parser contains a reduce item and one or more shift items, or more than one reduce item, then the state contains a conflict. Such conflicts may be resolved if it can be determined that the reductions are valid only for certain input symbols. In any state, if the sets of valid input symbols ("lookahead sets") for each reduction and the set of terminal symbols for which successor states exist are disjoint, then there is no conflict in that state, since the parser can determine by looking at the current input symbol whether to shift or to reduce, and what reduction to make. In YACC, the lookahead sets are computed one terminal symbol at a time; that is, for each terminal symbol, it is determined which reductions are applicable (contain that terminal symbol in their lookahead set). Then, each state of the LR(0) parser is checked for conflicts on that terminal symbol. If there are more than one applicable reduction, then a reduce/reduce conflict is announced. If there is a successor state on that terminal symbol and one or more applicable reductions, then a shift/reduce conflict is announced.