mirror of
https://github.com/PDP-10/its.git
synced 2026-04-25 11:51:38 +00:00
YACC - parser generator.
Binary file originally from ES; TS YACC, timestamped 1978-05-21. The help file is from a TOPS-20 machine; timestamp 1981-08-25.
This commit is contained in:
BIN
bin/c/ts.yacc
Normal file
BIN
bin/c/ts.yacc
Normal file
Binary file not shown.
341
doc/c/oyacc.hlp
Normal file
341
doc/c/oyacc.hlp
Normal file
@@ -0,0 +1,341 @@
|
|||||||
|
YACC - Use and Operation
|
||||||
|
- 2 -
|
||||||
|
|
||||||
|
|
||||||
|
1. Introduction
|
||||||
|
|
||||||
|
This paper describes the use and operation of a LALR(1) parser
|
||||||
|
generator YACC (Yet Another Compiler-Compiler). YACC accepts as
|
||||||
|
input a BNF-like grammar and, if possible, produces as output a
|
||||||
|
set of tables for a table-driven shift-reduce parsing routine.
|
||||||
|
The parsing routine and the tables together form a parser which
|
||||||
|
recognizes the language defined by the grammar. The parser
|
||||||
|
generated by YACC can be used as the core of a syntax analyzer by
|
||||||
|
including in the grammar calls to user-provided action routines.
|
||||||
|
These calls are made by the parser at the appropriate points in
|
||||||
|
the analysis of the input string.
|
||||||
|
|
||||||
|
The class of LALR(1) grammars is a subclass of the class of LR(1)
|
||||||
|
grammars, those which can be parsed by a deterministic bottom-up
|
||||||
|
parser using one symbol of lookahead. The LALR(1) grammars are
|
||||||
|
those LR(1) grammars for which a parser can be constructed by a
|
||||||
|
relatively efficient process. Theoretically, all deterministic
|
||||||
|
context-free languages have a LR(1) grammar, but not necessarily
|
||||||
|
a LALR(1) grammar. Practically, however, it has been observed
|
||||||
|
that most common programming languages have "natural" grammars
|
||||||
|
which are easily converted to be LALR(1).
|
||||||
|
|
||||||
|
The original YACC was designed and implemented on a PDP-11/45 and
|
||||||
|
a Honeywell 6000 by S. C. Johnson at Bell Laboratories. The
|
||||||
|
version described in this paper was implemented on the PDP-10 by
|
||||||
|
Alan Snyder.
|
||||||
|
- 3 -
|
||||||
|
|
||||||
|
|
||||||
|
2. Using YACC
|
||||||
|
|
||||||
|
In the simplest case, the input to YACC is a file containing a
|
||||||
|
BNF-like grammar for the language. The grammar consists of a
|
||||||
|
sequence of rules, which have the following syntax:
|
||||||
|
|
||||||
|
rule: lhs ':' rhs_list
|
||||||
|
lhs: symbol
|
||||||
|
rhs_list: rhs | rhs_list '|' rhs
|
||||||
|
rhs: symbol_sequence
|
||||||
|
symbol_sequence: symbol | symbol_sequence symbol
|
||||||
|
|
||||||
|
The above rules for rules are examples of rules. Another example
|
||||||
|
is the following simple grammar for expressions:
|
||||||
|
|
||||||
|
e: e '+' t | e '-' t | t
|
||||||
|
t: t '*' p | t '/' p | p
|
||||||
|
p: idn | '(' e ')'
|
||||||
|
|
||||||
|
A symbol is any sequence of alphanumeric characters, including
|
||||||
|
underlines, dollar signs, and periods. In addition, a symbol may
|
||||||
|
be any sequence of characters enclosed in single quotes.
|
||||||
|
|
||||||
|
The symbols which appear as the left-hand-sides of rules are the
|
||||||
|
non-terminal symbols; all other symbols appearing in the grammar
|
||||||
|
are assumed to be terminal symbols. The symbol appearing as the
|
||||||
|
left-hand-side of the first rule is considered to be the start
|
||||||
|
symbol of the grammar.
|
||||||
|
|
||||||
|
After a file containing the grammar has been prepared, YACC may be
|
||||||
|
run. YACC will respond by asking for the name of the file containing
|
||||||
|
the grammar. After the file name is entered, YACC will analyze
|
||||||
|
the grammar and construct the parsing tables. YACC will print some
|
||||||
|
messages on the terminal to indicate its progress. When it has
|
||||||
|
finished, a listing will have been placed on the file YACC OUTPUT and
|
||||||
|
the parsing tables will have been written onto the file YACC TABLES.
|
||||||
|
|
||||||
|
In the process of constructing a parser for the grammar, YACC
|
||||||
|
may discover conflicts in the grammar. These conflicts indicate
|
||||||
|
that the grammar is not LALR(1). The conflicts, which are listed
|
||||||
|
in the OUTPUT file, may be of two types. The first type of
|
||||||
|
conflict is a shift/reduce conflict, abbreviated S/R. A
|
||||||
|
shift/reduce conflict indicates that, in the given state and with
|
||||||
|
the given input symbol, the constructed parser could legitimately
|
||||||
|
either shift the input symbol onto the stack or make an immediate
|
||||||
|
reduction. Shift/reduce conflicts are resolved by YACC in favor
|
||||||
|
of shifting. The second type of conflict is a reduce/reduce
|
||||||
|
conflict, abbreviated R/R. A reduce/reduce conflict indicates
|
||||||
|
that, in the given state and with the given input symbol, the
|
||||||
|
parser could legitimately make either of two reductions.
|
||||||
|
Reduce/reduce conflicts are resolved by YACC in favor of the
|
||||||
|
production appearing earlier in the input file.
|
||||||
|
|
||||||
|
The relation of a conflict to a problem in the grammar can be
|
||||||
|
- 4 -
|
||||||
|
|
||||||
|
|
||||||
|
determined by examining the description of the particular state
|
||||||
|
in the action table section of the OUTPUT file. The first part
|
||||||
|
of the description is a set of items, where an item is a rule
|
||||||
|
which contains a marker ('.') in the right-hand-side. The marker
|
||||||
|
indicates how much of the right-hand-side has been seen by the
|
||||||
|
parser when the parser is in that state. Thus, the collection of
|
||||||
|
items represents the set of possibilities being considered by the
|
||||||
|
parser when in that state. A conflict indicates that the parser
|
||||||
|
cannot discard one of two possibilities on the basis of the
|
||||||
|
current input symbol, yet any action it takes will have the
|
||||||
|
effect of eliminating one of the two possibilities.
|
||||||
|
- 5 -
|
||||||
|
|
||||||
|
|
||||||
|
3. Interfacing with a Lexical Analyzer
|
||||||
|
|
||||||
|
The parsing tables produced by YACC are in the form of a C
|
||||||
|
program, ready to be compiled by the C compiler (CC). This C
|
||||||
|
program may be loaded together with the compiled version of a
|
||||||
|
parsing routine in order to construct a working parser for the
|
||||||
|
language. A standard parsing routine, called PARSE, may be found
|
||||||
|
in the file "<C>YPARSE.C".
|
||||||
|
|
||||||
|
PARSE assumes the existence of a lexical routine, called GETTOK,
|
||||||
|
which it can call in order to obtain the next terminal symbol
|
||||||
|
from the input stream. GETTOK is expected to set the values of
|
||||||
|
three integer global variables, LEXTYPE, LEXINDEX, and LEXLINE.
|
||||||
|
LEXTYPE should be set to an integer which distinguishes which
|
||||||
|
terminal symbol has been read. The correspondence between
|
||||||
|
integers and terminal symbols is listed in the OUTPUT file
|
||||||
|
produced by YACC. However, it is more convenient when an actual
|
||||||
|
parser is to be constructed to specify in the grammar the
|
||||||
|
correspondence between integers and terminal symbols. This is
|
||||||
|
done by listing at the beginning of the file the terminal symbols
|
||||||
|
of the grammar. They will be numbered consecutively, starting
|
||||||
|
with 3. (The integer 1 is to be returned by the lexical routine
|
||||||
|
to indicate the end of the input stream; the integer 2 is
|
||||||
|
reserved for an error recovery method.) The listing of terminal
|
||||||
|
symbols in the grammar should be separated from the list of rules
|
||||||
|
by the symbol '\\'. For example, the grammar
|
||||||
|
|
||||||
|
'+' '-' '*' '/' '(' ')' idn
|
||||||
|
|
||||||
|
\\
|
||||||
|
|
||||||
|
e: e '+' t | e '-' t | t
|
||||||
|
t: t '*' p | t '/' p | p
|
||||||
|
p: idn | '(' e ')'
|
||||||
|
|
||||||
|
defines the following representations of terminal symbols:
|
||||||
|
|
||||||
|
eof 1
|
||||||
|
+ 3
|
||||||
|
- 4
|
||||||
|
* 5
|
||||||
|
/ 6
|
||||||
|
( 7
|
||||||
|
) 8
|
||||||
|
idn 9
|
||||||
|
|
||||||
|
The variable LEXLINE should be set to the line number in the
|
||||||
|
input file on which the terminal symbol being returned appeared;
|
||||||
|
this value is used by PARSE when reporting syntax errors and is
|
||||||
|
made available to any action routines. The variable LEXINDEX is
|
||||||
|
used only when performing translations (see next section).
|
||||||
|
|
||||||
|
In addition, PARSE requires a routine PTOKEN which will print
|
||||||
|
some symbolic representation of a token; this routine is used
|
||||||
|
when reporting syntax errors.
|
||||||
|
- 6 -
|
||||||
|
|
||||||
|
|
||||||
|
4. Performing Translations
|
||||||
|
|
||||||
|
As described so far, the parser performs only recognition; that
|
||||||
|
is, given an input string of terminal symbols, it will produce
|
||||||
|
error messages if the string is not in the language defined by
|
||||||
|
the grammar and do nothing otherwise. YACC is capable also of
|
||||||
|
producing tables for a parser which performs translations, for
|
||||||
|
example, the syntax analyzer of a compiler. The following
|
||||||
|
extension is made in order to support translation: the parser
|
||||||
|
associates with each terminal symbol (received from the lexical
|
||||||
|
routine) and each nonterminal symbol (resulting from a reduction)
|
||||||
|
a word (integer, pointer) called a translation element. The
|
||||||
|
translation element for a terminal symbol is produced by the
|
||||||
|
lexical routine; it is communicated to PARSE via the global
|
||||||
|
variable LEXINDEX. Typically, the translation element for a
|
||||||
|
terminal symbol is used to distinguish between different
|
||||||
|
identifiers and constants. The translation element for a
|
||||||
|
nonterminal symbol is obtained by calling a user-provided action
|
||||||
|
routine when a reduction is made which produces the nonterminal
|
||||||
|
symbol. This action routine is specified by following the
|
||||||
|
production rule in the grammar with the body of the routine,
|
||||||
|
enclosed in braces. The action routine may access the
|
||||||
|
translation elements associated with the symbols on the right-
|
||||||
|
hand-side of the production using the notation #n, where "n" is
|
||||||
|
the number of the symbol (i.e., #1 refers to the translation
|
||||||
|
element for the first symbol of the right-hand-side). The action
|
||||||
|
routine specifies the value for the left-hand-side by setting the
|
||||||
|
global variable VAL. A typical action routine in a parser which
|
||||||
|
produces tree representations is
|
||||||
|
|
||||||
|
{val=node(node_type,#1,#2,#3);}
|
||||||
|
|
||||||
|
where node is a routine which constructs nodes of the tree and
|
||||||
|
node_type is a tag which indicates the type of the node. An
|
||||||
|
action routine may also specify a line-number to be associated
|
||||||
|
with the left-hand-side by setting the global variable LINE; the
|
||||||
|
line-numbers of the symbols on the right-hand-side are accessible
|
||||||
|
through the global variable PL (i.e., pl[3] refers to the line-
|
||||||
|
number of the third symbol on the right-hand-side).
|
||||||
|
- 7 -
|
||||||
|
|
||||||
|
|
||||||
|
5. Disambiguation
|
||||||
|
|
||||||
|
YACC is capable of disambiguating ambiguous grammars through the
|
||||||
|
use of precedence and associativity information. This is
|
||||||
|
especially useful in the case of arithmetic expressions since it
|
||||||
|
allows a much simpler grammar to be used. For example, the
|
||||||
|
grammar for expressions given above could be written:
|
||||||
|
|
||||||
|
'+' '-' '*' '/' '(' ')' idn
|
||||||
|
|
||||||
|
\< '+' '-'
|
||||||
|
\< '*' '/'
|
||||||
|
|
||||||
|
\\
|
||||||
|
|
||||||
|
e: e '+' e
|
||||||
|
| e '-' e
|
||||||
|
| e '*' e
|
||||||
|
| e '/' e
|
||||||
|
| idn
|
||||||
|
| '(' e ')'
|
||||||
|
|
||||||
|
The two lines following the list of terminal symbols create two
|
||||||
|
levels of precedence in increasing order and assign those levels
|
||||||
|
to the terminal symbols appearing on those lines. The '\<' which
|
||||||
|
begins a new precedence level also indicates left-association.
|
||||||
|
One may also specify '\>' for right-association and '\2'
|
||||||
|
indicating that association is not permitted (is to be regarded
|
||||||
|
as a syntax error). This last feature may be used to prohibit
|
||||||
|
the misleading association of operators such as comparision
|
||||||
|
operators.
|
||||||
|
- 8 -
|
||||||
|
|
||||||
|
|
||||||
|
6. The Operation of YACC
|
||||||
|
|
||||||
|
The operation of YACC is performed in five steps. First, the
|
||||||
|
input file is read and an internal representation of the grammar
|
||||||
|
is created. Second, certain auxiliary data structures are
|
||||||
|
constructed which contain information about the grammar which is
|
||||||
|
used by later steps. Third, the canonical LR(0) parser for the
|
||||||
|
grammar is constructed. Fourth, the LR(0) parser is analyzed by
|
||||||
|
computing and applying lookahead in order to resolve conflicts in
|
||||||
|
the LR(0) parser. Finally, a listing is written onto the OUTPUT
|
||||||
|
file containing the remaining conflicts in the parser, the
|
||||||
|
grammar, and the parser itself, and the tables are written onto
|
||||||
|
the TABLES file.
|
||||||
|
|
||||||
|
6.1 Constructing the Canonical LR(0) Parser
|
||||||
|
|
||||||
|
The canonical LR(0) parser for the grammar is constructed by the
|
||||||
|
following method: First, the grammar is augmented by adding a
|
||||||
|
production
|
||||||
|
|
||||||
|
$accept: S -|
|
||||||
|
|
||||||
|
where the symbol $accept is a distinguished nonterminal added by
|
||||||
|
YACC, S represents the starting symbol of the original grammar,
|
||||||
|
and -| represents the end-of-file symbol. Second, the initial
|
||||||
|
state of the parser is created containing the item
|
||||||
|
|
||||||
|
$accept -> . S -|
|
||||||
|
|
||||||
|
and its closure. The closure of a set of items I is defined to
|
||||||
|
be the smallest set of items C containing I such that if C
|
||||||
|
contains an item of the form
|
||||||
|
|
||||||
|
A -> a . B b
|
||||||
|
|
||||||
|
for some nonterminal B and strings a and b, then C contains all
|
||||||
|
items of the form
|
||||||
|
|
||||||
|
B -> . w
|
||||||
|
|
||||||
|
for string w. The final step in constructing the canonical LR(0)
|
||||||
|
parser consists of constructing the set of states accesible from
|
||||||
|
the initial state. The set of accesible states is defined to be
|
||||||
|
the smallest set of states containing the initial state such that
|
||||||
|
for each state i in S, if j is the successor state of i on some
|
||||||
|
symbol x, then j is in S. The successor state j of a state i on
|
||||||
|
a symbol x is constructed in two steps: First, for each item in
|
||||||
|
state i of the form
|
||||||
|
|
||||||
|
A -> a . x b
|
||||||
|
|
||||||
|
for nonterminal A and strings a and b, the item
|
||||||
|
|
||||||
|
A -> a x . b
|
||||||
|
|
||||||
|
- 9 -
|
||||||
|
|
||||||
|
|
||||||
|
is added to state j. Second, the closure of the set of items in
|
||||||
|
state j is added to state j.
|
||||||
|
|
||||||
|
6.2 Applying Lookahead to the LR(0) Parser
|
||||||
|
|
||||||
|
The constructed LR(0) parser will generally contain conflicts,
|
||||||
|
that is, states in which more than one action is valid for some
|
||||||
|
input symbol. An item of the form
|
||||||
|
|
||||||
|
A -> a .
|
||||||
|
|
||||||
|
is called a reduce item (reduction) since it indicates that the
|
||||||
|
entire right-hand-side of a rule has been recognized and can be
|
||||||
|
reduced to the left-hand-side. An item of the form
|
||||||
|
|
||||||
|
A -> a . x b
|
||||||
|
|
||||||
|
where x is a terminal symbol, is called a shift item since it
|
||||||
|
indicates that if x is the current input symbol, then it should
|
||||||
|
be shifted onto the stack and control passed to the x-successor
|
||||||
|
state, which will contain the item
|
||||||
|
|
||||||
|
A -> a x . b
|
||||||
|
|
||||||
|
If a state in the LR(0) parser contains a reduce item and one or
|
||||||
|
more shift items, or more than one reduce item, then the state
|
||||||
|
contains a conflict. Such conflicts may be resolved if it can be
|
||||||
|
determined that the reductions are valid only for certain input
|
||||||
|
symbols. In any state, if the sets of valid input symbols
|
||||||
|
("lookahead sets") for each reduction and the set of terminal
|
||||||
|
symbols for which successor states exist are disjoint, then there
|
||||||
|
is no conflict in that state, since the parser can determine by
|
||||||
|
looking at the current input symbol whether to shift or to
|
||||||
|
reduce, and what reduction to make.
|
||||||
|
|
||||||
|
In YACC, the lookahead sets are computed one terminal symbol at a
|
||||||
|
time; that is, for each terminal symbol, it is determined which
|
||||||
|
reductions are applicable (contain that terminal symbol in their
|
||||||
|
lookahead set). Then, each state of the LR(0) parser is checked
|
||||||
|
for conflicts on that terminal symbol. If there are more than
|
||||||
|
one applicable reduction, then a reduce/reduce conflict is
|
||||||
|
announced. If there is a successor state on that terminal symbol
|
||||||
|
and one or more applicable reductions, then a shift/reduce
|
||||||
|
conflict is announced.
|
||||||
@@ -341,6 +341,7 @@
|
|||||||
- XGP, PDP-11 controller for the Xerox Graphics Printer.
|
- XGP, PDP-11 controller for the Xerox Graphics Printer.
|
||||||
- XGPSPL, spooler for the Xerox Graphics Printer.
|
- XGPSPL, spooler for the Xerox Graphics Printer.
|
||||||
- XXFILE, feed scripted input to a STY session.
|
- XXFILE, feed scripted input to a STY session.
|
||||||
|
- YACC, parser generator.
|
||||||
- YAHTZE, the game of Yahtzee.
|
- YAHTZE, the game of Yahtzee.
|
||||||
- YOW, print Zippyisms.
|
- YOW, print Zippyisms.
|
||||||
- ZAP, dump TV bitmap as an XGP scan file.
|
- ZAP, dump TV bitmap as an XGP scan file.
|
||||||
|
|||||||
Reference in New Issue
Block a user