diff --git a/bin/c/ts.yacc b/bin/c/ts.yacc new file mode 100644 index 00000000..5dffc0ca Binary files /dev/null and b/bin/c/ts.yacc differ diff --git a/doc/c/oyacc.hlp b/doc/c/oyacc.hlp new file mode 100644 index 00000000..8e6b3df0 --- /dev/null +++ b/doc/c/oyacc.hlp @@ -0,0 +1,341 @@ + YACC - Use and Operation + - 2 - + + +1. Introduction + +This paper describes the use and operation of a LALR(1) parser +generator YACC (Yet Another Compiler-Compiler). YACC accepts as +input a BNF-like grammar and, if possible, produces as output a +set of tables for a table-driven shift-reduce parsing routine. +The parsing routine and the tables together form a parser which +recognizes the language defined by the grammar. The parser +generated by YACC can be used as the core of a syntax analyzer by +including in the grammar calls to user-provided action routines. +These calls are made by the parser at the appropriate points in +the analysis of the input string. + +The class of LALR(1) grammars is a subclass of the class of LR(1) +grammars, those which can be parsed by a deterministic bottom-up +parser using one symbol of lookahead. The LALR(1) grammars are +those LR(1) grammars for which a parser can be constructed by a +relatively efficient process. Theoretically, all deterministic +context-free languages have a LR(1) grammar, but not necessarily +a LALR(1) grammar. Practically, however, it has been observed +that most common programming languages have "natural" grammars +which are easily converted to be LALR(1). + +The original YACC was designed and implemented on a PDP-11/45 and +a Honeywell 6000 by S. C. Johnson at Bell Laboratories. The +version described in this paper was implemented on the PDP-10 by +Alan Snyder. + - 3 - + + +2. Using YACC + +In the simplest case, the input to YACC is a file containing a +BNF-like grammar for the language. The grammar consists of a +sequence of rules, which have the following syntax: + + rule: lhs ':' rhs_list + lhs: symbol + rhs_list: rhs | rhs_list '|' rhs + rhs: symbol_sequence + symbol_sequence: symbol | symbol_sequence symbol + +The above rules for rules are examples of rules. Another example +is the following simple grammar for expressions: + + e: e '+' t | e '-' t | t + t: t '*' p | t '/' p | p + p: idn | '(' e ')' + +A symbol is any sequence of alphanumeric characters, including +underlines, dollar signs, and periods. In addition, a symbol may +be any sequence of characters enclosed in single quotes. + +The symbols which appear as the left-hand-sides of rules are the +non-terminal symbols; all other symbols appearing in the grammar +are assumed to be terminal symbols. The symbol appearing as the +left-hand-side of the first rule is considered to be the start +symbol of the grammar. + +After a file containing the grammar has been prepared, YACC may be +run. YACC will respond by asking for the name of the file containing +the grammar. After the file name is entered, YACC will analyze +the grammar and construct the parsing tables. YACC will print some +messages on the terminal to indicate its progress. When it has +finished, a listing will have been placed on the file YACC OUTPUT and +the parsing tables will have been written onto the file YACC TABLES. + +In the process of constructing a parser for the grammar, YACC +may discover conflicts in the grammar. These conflicts indicate +that the grammar is not LALR(1). The conflicts, which are listed +in the OUTPUT file, may be of two types. The first type of +conflict is a shift/reduce conflict, abbreviated S/R. A +shift/reduce conflict indicates that, in the given state and with +the given input symbol, the constructed parser could legitimately +either shift the input symbol onto the stack or make an immediate +reduction. Shift/reduce conflicts are resolved by YACC in favor +of shifting. The second type of conflict is a reduce/reduce +conflict, abbreviated R/R. A reduce/reduce conflict indicates +that, in the given state and with the given input symbol, the +parser could legitimately make either of two reductions. +Reduce/reduce conflicts are resolved by YACC in favor of the +production appearing earlier in the input file. + +The relation of a conflict to a problem in the grammar can be + - 4 - + + +determined by examining the description of the particular state +in the action table section of the OUTPUT file. The first part +of the description is a set of items, where an item is a rule +which contains a marker ('.') in the right-hand-side. The marker +indicates how much of the right-hand-side has been seen by the +parser when the parser is in that state. Thus, the collection of +items represents the set of possibilities being considered by the +parser when in that state. A conflict indicates that the parser +cannot discard one of two possibilities on the basis of the +current input symbol, yet any action it takes will have the +effect of eliminating one of the two possibilities. + - 5 - + + +3. Interfacing with a Lexical Analyzer + +The parsing tables produced by YACC are in the form of a C +program, ready to be compiled by the C compiler (CC). This C +program may be loaded together with the compiled version of a +parsing routine in order to construct a working parser for the +language. A standard parsing routine, called PARSE, may be found +in the file "YPARSE.C". + +PARSE assumes the existence of a lexical routine, called GETTOK, +which it can call in order to obtain the next terminal symbol +from the input stream. GETTOK is expected to set the values of +three integer global variables, LEXTYPE, LEXINDEX, and LEXLINE. +LEXTYPE should be set to an integer which distinguishes which +terminal symbol has been read. The correspondence between +integers and terminal symbols is listed in the OUTPUT file +produced by YACC. However, it is more convenient when an actual +parser is to be constructed to specify in the grammar the +correspondence between integers and terminal symbols. This is +done by listing at the beginning of the file the terminal symbols +of the grammar. They will be numbered consecutively, starting +with 3. (The integer 1 is to be returned by the lexical routine +to indicate the end of the input stream; the integer 2 is +reserved for an error recovery method.) The listing of terminal +symbols in the grammar should be separated from the list of rules +by the symbol '\\'. For example, the grammar + + '+' '-' '*' '/' '(' ')' idn + + \\ + + e: e '+' t | e '-' t | t + t: t '*' p | t '/' p | p + p: idn | '(' e ')' + +defines the following representations of terminal symbols: + + eof 1 + + 3 + - 4 + * 5 + / 6 + ( 7 + ) 8 + idn 9 + +The variable LEXLINE should be set to the line number in the +input file on which the terminal symbol being returned appeared; +this value is used by PARSE when reporting syntax errors and is +made available to any action routines. The variable LEXINDEX is +used only when performing translations (see next section). + +In addition, PARSE requires a routine PTOKEN which will print +some symbolic representation of a token; this routine is used +when reporting syntax errors. + - 6 - + + +4. Performing Translations + +As described so far, the parser performs only recognition; that +is, given an input string of terminal symbols, it will produce +error messages if the string is not in the language defined by +the grammar and do nothing otherwise. YACC is capable also of +producing tables for a parser which performs translations, for +example, the syntax analyzer of a compiler. The following +extension is made in order to support translation: the parser +associates with each terminal symbol (received from the lexical +routine) and each nonterminal symbol (resulting from a reduction) +a word (integer, pointer) called a translation element. The +translation element for a terminal symbol is produced by the +lexical routine; it is communicated to PARSE via the global +variable LEXINDEX. Typically, the translation element for a +terminal symbol is used to distinguish between different +identifiers and constants. The translation element for a +nonterminal symbol is obtained by calling a user-provided action +routine when a reduction is made which produces the nonterminal +symbol. This action routine is specified by following the +production rule in the grammar with the body of the routine, +enclosed in braces. The action routine may access the +translation elements associated with the symbols on the right- +hand-side of the production using the notation #n, where "n" is +the number of the symbol (i.e., #1 refers to the translation +element for the first symbol of the right-hand-side). The action +routine specifies the value for the left-hand-side by setting the +global variable VAL. A typical action routine in a parser which +produces tree representations is + + {val=node(node_type,#1,#2,#3);} + +where node is a routine which constructs nodes of the tree and +node_type is a tag which indicates the type of the node. An +action routine may also specify a line-number to be associated +with the left-hand-side by setting the global variable LINE; the +line-numbers of the symbols on the right-hand-side are accessible +through the global variable PL (i.e., pl[3] refers to the line- +number of the third symbol on the right-hand-side). + - 7 - + + +5. Disambiguation + +YACC is capable of disambiguating ambiguous grammars through the +use of precedence and associativity information. This is +especially useful in the case of arithmetic expressions since it +allows a much simpler grammar to be used. For example, the +grammar for expressions given above could be written: + + '+' '-' '*' '/' '(' ')' idn + + \< '+' '-' + \< '*' '/' + + \\ + + e: e '+' e + | e '-' e + | e '*' e + | e '/' e + | idn + | '(' e ')' + +The two lines following the list of terminal symbols create two +levels of precedence in increasing order and assign those levels +to the terminal symbols appearing on those lines. The '\<' which +begins a new precedence level also indicates left-association. +One may also specify '\>' for right-association and '\2' +indicating that association is not permitted (is to be regarded +as a syntax error). This last feature may be used to prohibit +the misleading association of operators such as comparision +operators. + - 8 - + + +6. The Operation of YACC + +The operation of YACC is performed in five steps. First, the +input file is read and an internal representation of the grammar +is created. Second, certain auxiliary data structures are +constructed which contain information about the grammar which is +used by later steps. Third, the canonical LR(0) parser for the +grammar is constructed. Fourth, the LR(0) parser is analyzed by +computing and applying lookahead in order to resolve conflicts in +the LR(0) parser. Finally, a listing is written onto the OUTPUT +file containing the remaining conflicts in the parser, the +grammar, and the parser itself, and the tables are written onto +the TABLES file. + +6.1 Constructing the Canonical LR(0) Parser + +The canonical LR(0) parser for the grammar is constructed by the +following method: First, the grammar is augmented by adding a +production + + $accept: S -| + +where the symbol $accept is a distinguished nonterminal added by +YACC, S represents the starting symbol of the original grammar, +and -| represents the end-of-file symbol. Second, the initial +state of the parser is created containing the item + + $accept -> . S -| + +and its closure. The closure of a set of items I is defined to +be the smallest set of items C containing I such that if C +contains an item of the form + + A -> a . B b + +for some nonterminal B and strings a and b, then C contains all +items of the form + + B -> . w + +for string w. The final step in constructing the canonical LR(0) +parser consists of constructing the set of states accesible from +the initial state. The set of accesible states is defined to be +the smallest set of states containing the initial state such that +for each state i in S, if j is the successor state of i on some +symbol x, then j is in S. The successor state j of a state i on +a symbol x is constructed in two steps: First, for each item in +state i of the form + + A -> a . x b + +for nonterminal A and strings a and b, the item + + A -> a x . b + + - 9 - + + +is added to state j. Second, the closure of the set of items in +state j is added to state j. + +6.2 Applying Lookahead to the LR(0) Parser + +The constructed LR(0) parser will generally contain conflicts, +that is, states in which more than one action is valid for some +input symbol. An item of the form + + A -> a . + +is called a reduce item (reduction) since it indicates that the +entire right-hand-side of a rule has been recognized and can be +reduced to the left-hand-side. An item of the form + + A -> a . x b + +where x is a terminal symbol, is called a shift item since it +indicates that if x is the current input symbol, then it should +be shifted onto the stack and control passed to the x-successor +state, which will contain the item + + A -> a x . b + +If a state in the LR(0) parser contains a reduce item and one or +more shift items, or more than one reduce item, then the state +contains a conflict. Such conflicts may be resolved if it can be +determined that the reductions are valid only for certain input +symbols. In any state, if the sets of valid input symbols +("lookahead sets") for each reduction and the set of terminal +symbols for which successor states exist are disjoint, then there +is no conflict in that state, since the parser can determine by +looking at the current input symbol whether to shift or to +reduce, and what reduction to make. + +In YACC, the lookahead sets are computed one terminal symbol at a +time; that is, for each terminal symbol, it is determined which +reductions are applicable (contain that terminal symbol in their +lookahead set). Then, each state of the LR(0) parser is checked +for conflicts on that terminal symbol. If there are more than +one applicable reduction, then a reduce/reduce conflict is +announced. If there is a successor state on that terminal symbol +and one or more applicable reductions, then a shift/reduce +conflict is announced. diff --git a/doc/programs.md b/doc/programs.md index 900a8ac5..af59e2f7 100644 --- a/doc/programs.md +++ b/doc/programs.md @@ -341,6 +341,7 @@ - XGP, PDP-11 controller for the Xerox Graphics Printer. - XGPSPL, spooler for the Xerox Graphics Printer. - XXFILE, feed scripted input to a STY session. +- YACC, parser generator. - YAHTZE, the game of Yahtzee. - YOW, print Zippyisms. - ZAP, dump TV bitmap as an XGP scan file.