mirror of
https://github.com/PDP-10/its.git
synced 2026-01-26 04:02:06 +00:00
YACC - parser generator.
Binary file originally from ES; TS YACC, timestamped 1978-05-21. The help file is from a TOPS-20 machine; timestamp 1981-08-25.
This commit is contained in:
341
doc/c/oyacc.hlp
Normal file
341
doc/c/oyacc.hlp
Normal file
@@ -0,0 +1,341 @@
|
||||
YACC - Use and Operation
|
||||
- 2 -
|
||||
|
||||
|
||||
1. Introduction
|
||||
|
||||
This paper describes the use and operation of a LALR(1) parser
|
||||
generator YACC (Yet Another Compiler-Compiler). YACC accepts as
|
||||
input a BNF-like grammar and, if possible, produces as output a
|
||||
set of tables for a table-driven shift-reduce parsing routine.
|
||||
The parsing routine and the tables together form a parser which
|
||||
recognizes the language defined by the grammar. The parser
|
||||
generated by YACC can be used as the core of a syntax analyzer by
|
||||
including in the grammar calls to user-provided action routines.
|
||||
These calls are made by the parser at the appropriate points in
|
||||
the analysis of the input string.
|
||||
|
||||
The class of LALR(1) grammars is a subclass of the class of LR(1)
|
||||
grammars, those which can be parsed by a deterministic bottom-up
|
||||
parser using one symbol of lookahead. The LALR(1) grammars are
|
||||
those LR(1) grammars for which a parser can be constructed by a
|
||||
relatively efficient process. Theoretically, all deterministic
|
||||
context-free languages have a LR(1) grammar, but not necessarily
|
||||
a LALR(1) grammar. Practically, however, it has been observed
|
||||
that most common programming languages have "natural" grammars
|
||||
which are easily converted to be LALR(1).
|
||||
|
||||
The original YACC was designed and implemented on a PDP-11/45 and
|
||||
a Honeywell 6000 by S. C. Johnson at Bell Laboratories. The
|
||||
version described in this paper was implemented on the PDP-10 by
|
||||
Alan Snyder.
|
||||
- 3 -
|
||||
|
||||
|
||||
2. Using YACC
|
||||
|
||||
In the simplest case, the input to YACC is a file containing a
|
||||
BNF-like grammar for the language. The grammar consists of a
|
||||
sequence of rules, which have the following syntax:
|
||||
|
||||
rule: lhs ':' rhs_list
|
||||
lhs: symbol
|
||||
rhs_list: rhs | rhs_list '|' rhs
|
||||
rhs: symbol_sequence
|
||||
symbol_sequence: symbol | symbol_sequence symbol
|
||||
|
||||
The above rules for rules are examples of rules. Another example
|
||||
is the following simple grammar for expressions:
|
||||
|
||||
e: e '+' t | e '-' t | t
|
||||
t: t '*' p | t '/' p | p
|
||||
p: idn | '(' e ')'
|
||||
|
||||
A symbol is any sequence of alphanumeric characters, including
|
||||
underlines, dollar signs, and periods. In addition, a symbol may
|
||||
be any sequence of characters enclosed in single quotes.
|
||||
|
||||
The symbols which appear as the left-hand-sides of rules are the
|
||||
non-terminal symbols; all other symbols appearing in the grammar
|
||||
are assumed to be terminal symbols. The symbol appearing as the
|
||||
left-hand-side of the first rule is considered to be the start
|
||||
symbol of the grammar.
|
||||
|
||||
After a file containing the grammar has been prepared, YACC may be
|
||||
run. YACC will respond by asking for the name of the file containing
|
||||
the grammar. After the file name is entered, YACC will analyze
|
||||
the grammar and construct the parsing tables. YACC will print some
|
||||
messages on the terminal to indicate its progress. When it has
|
||||
finished, a listing will have been placed on the file YACC OUTPUT and
|
||||
the parsing tables will have been written onto the file YACC TABLES.
|
||||
|
||||
In the process of constructing a parser for the grammar, YACC
|
||||
may discover conflicts in the grammar. These conflicts indicate
|
||||
that the grammar is not LALR(1). The conflicts, which are listed
|
||||
in the OUTPUT file, may be of two types. The first type of
|
||||
conflict is a shift/reduce conflict, abbreviated S/R. A
|
||||
shift/reduce conflict indicates that, in the given state and with
|
||||
the given input symbol, the constructed parser could legitimately
|
||||
either shift the input symbol onto the stack or make an immediate
|
||||
reduction. Shift/reduce conflicts are resolved by YACC in favor
|
||||
of shifting. The second type of conflict is a reduce/reduce
|
||||
conflict, abbreviated R/R. A reduce/reduce conflict indicates
|
||||
that, in the given state and with the given input symbol, the
|
||||
parser could legitimately make either of two reductions.
|
||||
Reduce/reduce conflicts are resolved by YACC in favor of the
|
||||
production appearing earlier in the input file.
|
||||
|
||||
The relation of a conflict to a problem in the grammar can be
|
||||
- 4 -
|
||||
|
||||
|
||||
determined by examining the description of the particular state
|
||||
in the action table section of the OUTPUT file. The first part
|
||||
of the description is a set of items, where an item is a rule
|
||||
which contains a marker ('.') in the right-hand-side. The marker
|
||||
indicates how much of the right-hand-side has been seen by the
|
||||
parser when the parser is in that state. Thus, the collection of
|
||||
items represents the set of possibilities being considered by the
|
||||
parser when in that state. A conflict indicates that the parser
|
||||
cannot discard one of two possibilities on the basis of the
|
||||
current input symbol, yet any action it takes will have the
|
||||
effect of eliminating one of the two possibilities.
|
||||
- 5 -
|
||||
|
||||
|
||||
3. Interfacing with a Lexical Analyzer
|
||||
|
||||
The parsing tables produced by YACC are in the form of a C
|
||||
program, ready to be compiled by the C compiler (CC). This C
|
||||
program may be loaded together with the compiled version of a
|
||||
parsing routine in order to construct a working parser for the
|
||||
language. A standard parsing routine, called PARSE, may be found
|
||||
in the file "<C>YPARSE.C".
|
||||
|
||||
PARSE assumes the existence of a lexical routine, called GETTOK,
|
||||
which it can call in order to obtain the next terminal symbol
|
||||
from the input stream. GETTOK is expected to set the values of
|
||||
three integer global variables, LEXTYPE, LEXINDEX, and LEXLINE.
|
||||
LEXTYPE should be set to an integer which distinguishes which
|
||||
terminal symbol has been read. The correspondence between
|
||||
integers and terminal symbols is listed in the OUTPUT file
|
||||
produced by YACC. However, it is more convenient when an actual
|
||||
parser is to be constructed to specify in the grammar the
|
||||
correspondence between integers and terminal symbols. This is
|
||||
done by listing at the beginning of the file the terminal symbols
|
||||
of the grammar. They will be numbered consecutively, starting
|
||||
with 3. (The integer 1 is to be returned by the lexical routine
|
||||
to indicate the end of the input stream; the integer 2 is
|
||||
reserved for an error recovery method.) The listing of terminal
|
||||
symbols in the grammar should be separated from the list of rules
|
||||
by the symbol '\\'. For example, the grammar
|
||||
|
||||
'+' '-' '*' '/' '(' ')' idn
|
||||
|
||||
\\
|
||||
|
||||
e: e '+' t | e '-' t | t
|
||||
t: t '*' p | t '/' p | p
|
||||
p: idn | '(' e ')'
|
||||
|
||||
defines the following representations of terminal symbols:
|
||||
|
||||
eof 1
|
||||
+ 3
|
||||
- 4
|
||||
* 5
|
||||
/ 6
|
||||
( 7
|
||||
) 8
|
||||
idn 9
|
||||
|
||||
The variable LEXLINE should be set to the line number in the
|
||||
input file on which the terminal symbol being returned appeared;
|
||||
this value is used by PARSE when reporting syntax errors and is
|
||||
made available to any action routines. The variable LEXINDEX is
|
||||
used only when performing translations (see next section).
|
||||
|
||||
In addition, PARSE requires a routine PTOKEN which will print
|
||||
some symbolic representation of a token; this routine is used
|
||||
when reporting syntax errors.
|
||||
- 6 -
|
||||
|
||||
|
||||
4. Performing Translations
|
||||
|
||||
As described so far, the parser performs only recognition; that
|
||||
is, given an input string of terminal symbols, it will produce
|
||||
error messages if the string is not in the language defined by
|
||||
the grammar and do nothing otherwise. YACC is capable also of
|
||||
producing tables for a parser which performs translations, for
|
||||
example, the syntax analyzer of a compiler. The following
|
||||
extension is made in order to support translation: the parser
|
||||
associates with each terminal symbol (received from the lexical
|
||||
routine) and each nonterminal symbol (resulting from a reduction)
|
||||
a word (integer, pointer) called a translation element. The
|
||||
translation element for a terminal symbol is produced by the
|
||||
lexical routine; it is communicated to PARSE via the global
|
||||
variable LEXINDEX. Typically, the translation element for a
|
||||
terminal symbol is used to distinguish between different
|
||||
identifiers and constants. The translation element for a
|
||||
nonterminal symbol is obtained by calling a user-provided action
|
||||
routine when a reduction is made which produces the nonterminal
|
||||
symbol. This action routine is specified by following the
|
||||
production rule in the grammar with the body of the routine,
|
||||
enclosed in braces. The action routine may access the
|
||||
translation elements associated with the symbols on the right-
|
||||
hand-side of the production using the notation #n, where "n" is
|
||||
the number of the symbol (i.e., #1 refers to the translation
|
||||
element for the first symbol of the right-hand-side). The action
|
||||
routine specifies the value for the left-hand-side by setting the
|
||||
global variable VAL. A typical action routine in a parser which
|
||||
produces tree representations is
|
||||
|
||||
{val=node(node_type,#1,#2,#3);}
|
||||
|
||||
where node is a routine which constructs nodes of the tree and
|
||||
node_type is a tag which indicates the type of the node. An
|
||||
action routine may also specify a line-number to be associated
|
||||
with the left-hand-side by setting the global variable LINE; the
|
||||
line-numbers of the symbols on the right-hand-side are accessible
|
||||
through the global variable PL (i.e., pl[3] refers to the line-
|
||||
number of the third symbol on the right-hand-side).
|
||||
- 7 -
|
||||
|
||||
|
||||
5. Disambiguation
|
||||
|
||||
YACC is capable of disambiguating ambiguous grammars through the
|
||||
use of precedence and associativity information. This is
|
||||
especially useful in the case of arithmetic expressions since it
|
||||
allows a much simpler grammar to be used. For example, the
|
||||
grammar for expressions given above could be written:
|
||||
|
||||
'+' '-' '*' '/' '(' ')' idn
|
||||
|
||||
\< '+' '-'
|
||||
\< '*' '/'
|
||||
|
||||
\\
|
||||
|
||||
e: e '+' e
|
||||
| e '-' e
|
||||
| e '*' e
|
||||
| e '/' e
|
||||
| idn
|
||||
| '(' e ')'
|
||||
|
||||
The two lines following the list of terminal symbols create two
|
||||
levels of precedence in increasing order and assign those levels
|
||||
to the terminal symbols appearing on those lines. The '\<' which
|
||||
begins a new precedence level also indicates left-association.
|
||||
One may also specify '\>' for right-association and '\2'
|
||||
indicating that association is not permitted (is to be regarded
|
||||
as a syntax error). This last feature may be used to prohibit
|
||||
the misleading association of operators such as comparision
|
||||
operators.
|
||||
- 8 -
|
||||
|
||||
|
||||
6. The Operation of YACC
|
||||
|
||||
The operation of YACC is performed in five steps. First, the
|
||||
input file is read and an internal representation of the grammar
|
||||
is created. Second, certain auxiliary data structures are
|
||||
constructed which contain information about the grammar which is
|
||||
used by later steps. Third, the canonical LR(0) parser for the
|
||||
grammar is constructed. Fourth, the LR(0) parser is analyzed by
|
||||
computing and applying lookahead in order to resolve conflicts in
|
||||
the LR(0) parser. Finally, a listing is written onto the OUTPUT
|
||||
file containing the remaining conflicts in the parser, the
|
||||
grammar, and the parser itself, and the tables are written onto
|
||||
the TABLES file.
|
||||
|
||||
6.1 Constructing the Canonical LR(0) Parser
|
||||
|
||||
The canonical LR(0) parser for the grammar is constructed by the
|
||||
following method: First, the grammar is augmented by adding a
|
||||
production
|
||||
|
||||
$accept: S -|
|
||||
|
||||
where the symbol $accept is a distinguished nonterminal added by
|
||||
YACC, S represents the starting symbol of the original grammar,
|
||||
and -| represents the end-of-file symbol. Second, the initial
|
||||
state of the parser is created containing the item
|
||||
|
||||
$accept -> . S -|
|
||||
|
||||
and its closure. The closure of a set of items I is defined to
|
||||
be the smallest set of items C containing I such that if C
|
||||
contains an item of the form
|
||||
|
||||
A -> a . B b
|
||||
|
||||
for some nonterminal B and strings a and b, then C contains all
|
||||
items of the form
|
||||
|
||||
B -> . w
|
||||
|
||||
for string w. The final step in constructing the canonical LR(0)
|
||||
parser consists of constructing the set of states accesible from
|
||||
the initial state. The set of accesible states is defined to be
|
||||
the smallest set of states containing the initial state such that
|
||||
for each state i in S, if j is the successor state of i on some
|
||||
symbol x, then j is in S. The successor state j of a state i on
|
||||
a symbol x is constructed in two steps: First, for each item in
|
||||
state i of the form
|
||||
|
||||
A -> a . x b
|
||||
|
||||
for nonterminal A and strings a and b, the item
|
||||
|
||||
A -> a x . b
|
||||
|
||||
- 9 -
|
||||
|
||||
|
||||
is added to state j. Second, the closure of the set of items in
|
||||
state j is added to state j.
|
||||
|
||||
6.2 Applying Lookahead to the LR(0) Parser
|
||||
|
||||
The constructed LR(0) parser will generally contain conflicts,
|
||||
that is, states in which more than one action is valid for some
|
||||
input symbol. An item of the form
|
||||
|
||||
A -> a .
|
||||
|
||||
is called a reduce item (reduction) since it indicates that the
|
||||
entire right-hand-side of a rule has been recognized and can be
|
||||
reduced to the left-hand-side. An item of the form
|
||||
|
||||
A -> a . x b
|
||||
|
||||
where x is a terminal symbol, is called a shift item since it
|
||||
indicates that if x is the current input symbol, then it should
|
||||
be shifted onto the stack and control passed to the x-successor
|
||||
state, which will contain the item
|
||||
|
||||
A -> a x . b
|
||||
|
||||
If a state in the LR(0) parser contains a reduce item and one or
|
||||
more shift items, or more than one reduce item, then the state
|
||||
contains a conflict. Such conflicts may be resolved if it can be
|
||||
determined that the reductions are valid only for certain input
|
||||
symbols. In any state, if the sets of valid input symbols
|
||||
("lookahead sets") for each reduction and the set of terminal
|
||||
symbols for which successor states exist are disjoint, then there
|
||||
is no conflict in that state, since the parser can determine by
|
||||
looking at the current input symbol whether to shift or to
|
||||
reduce, and what reduction to make.
|
||||
|
||||
In YACC, the lookahead sets are computed one terminal symbol at a
|
||||
time; that is, for each terminal symbol, it is determined which
|
||||
reductions are applicable (contain that terminal symbol in their
|
||||
lookahead set). Then, each state of the LR(0) parser is checked
|
||||
for conflicts on that terminal symbol. If there are more than
|
||||
one applicable reduction, then a reduce/reduce conflict is
|
||||
announced. If there is a successor state on that terminal symbol
|
||||
and one or more applicable reductions, then a shift/reduce
|
||||
conflict is announced.
|
||||
@@ -341,6 +341,7 @@
|
||||
- XGP, PDP-11 controller for the Xerox Graphics Printer.
|
||||
- XGPSPL, spooler for the Xerox Graphics Printer.
|
||||
- XXFILE, feed scripted input to a STY session.
|
||||
- YACC, parser generator.
|
||||
- YAHTZE, the game of Yahtzee.
|
||||
- YOW, print Zippyisms.
|
||||
- ZAP, dump TV bitmap as an XGP scan file.
|
||||
|
||||
Reference in New Issue
Block a user