mirror of
https://github.com/PDP-10/its.git
synced 2026-01-13 15:27:28 +00:00
Binary file originally from ES; TS YACC, timestamped 1978-05-21. The help file is from a TOPS-20 machine; timestamp 1981-08-25.
342 lines
15 KiB
Plaintext
342 lines
15 KiB
Plaintext
YACC - Use and Operation
|
||
- 2 -
|
||
|
||
|
||
1. Introduction
|
||
|
||
This paper describes the use and operation of a LALR(1) parser
|
||
generator YACC (Yet Another Compiler-Compiler). YACC accepts as
|
||
input a BNF-like grammar and, if possible, produces as output a
|
||
set of tables for a table-driven shift-reduce parsing routine.
|
||
The parsing routine and the tables together form a parser which
|
||
recognizes the language defined by the grammar. The parser
|
||
generated by YACC can be used as the core of a syntax analyzer by
|
||
including in the grammar calls to user-provided action routines.
|
||
These calls are made by the parser at the appropriate points in
|
||
the analysis of the input string.
|
||
|
||
The class of LALR(1) grammars is a subclass of the class of LR(1)
|
||
grammars, those which can be parsed by a deterministic bottom-up
|
||
parser using one symbol of lookahead. The LALR(1) grammars are
|
||
those LR(1) grammars for which a parser can be constructed by a
|
||
relatively efficient process. Theoretically, all deterministic
|
||
context-free languages have a LR(1) grammar, but not necessarily
|
||
a LALR(1) grammar. Practically, however, it has been observed
|
||
that most common programming languages have "natural" grammars
|
||
which are easily converted to be LALR(1).
|
||
|
||
The original YACC was designed and implemented on a PDP-11/45 and
|
||
a Honeywell 6000 by S. C. Johnson at Bell Laboratories. The
|
||
version described in this paper was implemented on the PDP-10 by
|
||
Alan Snyder.
|
||
- 3 -
|
||
|
||
|
||
2. Using YACC
|
||
|
||
In the simplest case, the input to YACC is a file containing a
|
||
BNF-like grammar for the language. The grammar consists of a
|
||
sequence of rules, which have the following syntax:
|
||
|
||
rule: lhs ':' rhs_list
|
||
lhs: symbol
|
||
rhs_list: rhs | rhs_list '|' rhs
|
||
rhs: symbol_sequence
|
||
symbol_sequence: symbol | symbol_sequence symbol
|
||
|
||
The above rules for rules are examples of rules. Another example
|
||
is the following simple grammar for expressions:
|
||
|
||
e: e '+' t | e '-' t | t
|
||
t: t '*' p | t '/' p | p
|
||
p: idn | '(' e ')'
|
||
|
||
A symbol is any sequence of alphanumeric characters, including
|
||
underlines, dollar signs, and periods. In addition, a symbol may
|
||
be any sequence of characters enclosed in single quotes.
|
||
|
||
The symbols which appear as the left-hand-sides of rules are the
|
||
non-terminal symbols; all other symbols appearing in the grammar
|
||
are assumed to be terminal symbols. The symbol appearing as the
|
||
left-hand-side of the first rule is considered to be the start
|
||
symbol of the grammar.
|
||
|
||
After a file containing the grammar has been prepared, YACC may be
|
||
run. YACC will respond by asking for the name of the file containing
|
||
the grammar. After the file name is entered, YACC will analyze
|
||
the grammar and construct the parsing tables. YACC will print some
|
||
messages on the terminal to indicate its progress. When it has
|
||
finished, a listing will have been placed on the file YACC OUTPUT and
|
||
the parsing tables will have been written onto the file YACC TABLES.
|
||
|
||
In the process of constructing a parser for the grammar, YACC
|
||
may discover conflicts in the grammar. These conflicts indicate
|
||
that the grammar is not LALR(1). The conflicts, which are listed
|
||
in the OUTPUT file, may be of two types. The first type of
|
||
conflict is a shift/reduce conflict, abbreviated S/R. A
|
||
shift/reduce conflict indicates that, in the given state and with
|
||
the given input symbol, the constructed parser could legitimately
|
||
either shift the input symbol onto the stack or make an immediate
|
||
reduction. Shift/reduce conflicts are resolved by YACC in favor
|
||
of shifting. The second type of conflict is a reduce/reduce
|
||
conflict, abbreviated R/R. A reduce/reduce conflict indicates
|
||
that, in the given state and with the given input symbol, the
|
||
parser could legitimately make either of two reductions.
|
||
Reduce/reduce conflicts are resolved by YACC in favor of the
|
||
production appearing earlier in the input file.
|
||
|
||
The relation of a conflict to a problem in the grammar can be
|
||
- 4 -
|
||
|
||
|
||
determined by examining the description of the particular state
|
||
in the action table section of the OUTPUT file. The first part
|
||
of the description is a set of items, where an item is a rule
|
||
which contains a marker ('.') in the right-hand-side. The marker
|
||
indicates how much of the right-hand-side has been seen by the
|
||
parser when the parser is in that state. Thus, the collection of
|
||
items represents the set of possibilities being considered by the
|
||
parser when in that state. A conflict indicates that the parser
|
||
cannot discard one of two possibilities on the basis of the
|
||
current input symbol, yet any action it takes will have the
|
||
effect of eliminating one of the two possibilities.
|
||
- 5 -
|
||
|
||
|
||
3. Interfacing with a Lexical Analyzer
|
||
|
||
The parsing tables produced by YACC are in the form of a C
|
||
program, ready to be compiled by the C compiler (CC). This C
|
||
program may be loaded together with the compiled version of a
|
||
parsing routine in order to construct a working parser for the
|
||
language. A standard parsing routine, called PARSE, may be found
|
||
in the file "<C>YPARSE.C".
|
||
|
||
PARSE assumes the existence of a lexical routine, called GETTOK,
|
||
which it can call in order to obtain the next terminal symbol
|
||
from the input stream. GETTOK is expected to set the values of
|
||
three integer global variables, LEXTYPE, LEXINDEX, and LEXLINE.
|
||
LEXTYPE should be set to an integer which distinguishes which
|
||
terminal symbol has been read. The correspondence between
|
||
integers and terminal symbols is listed in the OUTPUT file
|
||
produced by YACC. However, it is more convenient when an actual
|
||
parser is to be constructed to specify in the grammar the
|
||
correspondence between integers and terminal symbols. This is
|
||
done by listing at the beginning of the file the terminal symbols
|
||
of the grammar. They will be numbered consecutively, starting
|
||
with 3. (The integer 1 is to be returned by the lexical routine
|
||
to indicate the end of the input stream; the integer 2 is
|
||
reserved for an error recovery method.) The listing of terminal
|
||
symbols in the grammar should be separated from the list of rules
|
||
by the symbol '\\'. For example, the grammar
|
||
|
||
'+' '-' '*' '/' '(' ')' idn
|
||
|
||
\\
|
||
|
||
e: e '+' t | e '-' t | t
|
||
t: t '*' p | t '/' p | p
|
||
p: idn | '(' e ')'
|
||
|
||
defines the following representations of terminal symbols:
|
||
|
||
eof 1
|
||
+ 3
|
||
- 4
|
||
* 5
|
||
/ 6
|
||
( 7
|
||
) 8
|
||
idn 9
|
||
|
||
The variable LEXLINE should be set to the line number in the
|
||
input file on which the terminal symbol being returned appeared;
|
||
this value is used by PARSE when reporting syntax errors and is
|
||
made available to any action routines. The variable LEXINDEX is
|
||
used only when performing translations (see next section).
|
||
|
||
In addition, PARSE requires a routine PTOKEN which will print
|
||
some symbolic representation of a token; this routine is used
|
||
when reporting syntax errors.
|
||
- 6 -
|
||
|
||
|
||
4. Performing Translations
|
||
|
||
As described so far, the parser performs only recognition; that
|
||
is, given an input string of terminal symbols, it will produce
|
||
error messages if the string is not in the language defined by
|
||
the grammar and do nothing otherwise. YACC is capable also of
|
||
producing tables for a parser which performs translations, for
|
||
example, the syntax analyzer of a compiler. The following
|
||
extension is made in order to support translation: the parser
|
||
associates with each terminal symbol (received from the lexical
|
||
routine) and each nonterminal symbol (resulting from a reduction)
|
||
a word (integer, pointer) called a translation element. The
|
||
translation element for a terminal symbol is produced by the
|
||
lexical routine; it is communicated to PARSE via the global
|
||
variable LEXINDEX. Typically, the translation element for a
|
||
terminal symbol is used to distinguish between different
|
||
identifiers and constants. The translation element for a
|
||
nonterminal symbol is obtained by calling a user-provided action
|
||
routine when a reduction is made which produces the nonterminal
|
||
symbol. This action routine is specified by following the
|
||
production rule in the grammar with the body of the routine,
|
||
enclosed in braces. The action routine may access the
|
||
translation elements associated with the symbols on the right-
|
||
hand-side of the production using the notation #n, where "n" is
|
||
the number of the symbol (i.e., #1 refers to the translation
|
||
element for the first symbol of the right-hand-side). The action
|
||
routine specifies the value for the left-hand-side by setting the
|
||
global variable VAL. A typical action routine in a parser which
|
||
produces tree representations is
|
||
|
||
{val=node(node_type,#1,#2,#3);}
|
||
|
||
where node is a routine which constructs nodes of the tree and
|
||
node_type is a tag which indicates the type of the node. An
|
||
action routine may also specify a line-number to be associated
|
||
with the left-hand-side by setting the global variable LINE; the
|
||
line-numbers of the symbols on the right-hand-side are accessible
|
||
through the global variable PL (i.e., pl[3] refers to the line-
|
||
number of the third symbol on the right-hand-side).
|
||
- 7 -
|
||
|
||
|
||
5. Disambiguation
|
||
|
||
YACC is capable of disambiguating ambiguous grammars through the
|
||
use of precedence and associativity information. This is
|
||
especially useful in the case of arithmetic expressions since it
|
||
allows a much simpler grammar to be used. For example, the
|
||
grammar for expressions given above could be written:
|
||
|
||
'+' '-' '*' '/' '(' ')' idn
|
||
|
||
\< '+' '-'
|
||
\< '*' '/'
|
||
|
||
\\
|
||
|
||
e: e '+' e
|
||
| e '-' e
|
||
| e '*' e
|
||
| e '/' e
|
||
| idn
|
||
| '(' e ')'
|
||
|
||
The two lines following the list of terminal symbols create two
|
||
levels of precedence in increasing order and assign those levels
|
||
to the terminal symbols appearing on those lines. The '\<' which
|
||
begins a new precedence level also indicates left-association.
|
||
One may also specify '\>' for right-association and '\2'
|
||
indicating that association is not permitted (is to be regarded
|
||
as a syntax error). This last feature may be used to prohibit
|
||
the misleading association of operators such as comparision
|
||
operators.
|
||
- 8 -
|
||
|
||
|
||
6. The Operation of YACC
|
||
|
||
The operation of YACC is performed in five steps. First, the
|
||
input file is read and an internal representation of the grammar
|
||
is created. Second, certain auxiliary data structures are
|
||
constructed which contain information about the grammar which is
|
||
used by later steps. Third, the canonical LR(0) parser for the
|
||
grammar is constructed. Fourth, the LR(0) parser is analyzed by
|
||
computing and applying lookahead in order to resolve conflicts in
|
||
the LR(0) parser. Finally, a listing is written onto the OUTPUT
|
||
file containing the remaining conflicts in the parser, the
|
||
grammar, and the parser itself, and the tables are written onto
|
||
the TABLES file.
|
||
|
||
6.1 Constructing the Canonical LR(0) Parser
|
||
|
||
The canonical LR(0) parser for the grammar is constructed by the
|
||
following method: First, the grammar is augmented by adding a
|
||
production
|
||
|
||
$accept: S -|
|
||
|
||
where the symbol $accept is a distinguished nonterminal added by
|
||
YACC, S represents the starting symbol of the original grammar,
|
||
and -| represents the end-of-file symbol. Second, the initial
|
||
state of the parser is created containing the item
|
||
|
||
$accept -> . S -|
|
||
|
||
and its closure. The closure of a set of items I is defined to
|
||
be the smallest set of items C containing I such that if C
|
||
contains an item of the form
|
||
|
||
A -> a . B b
|
||
|
||
for some nonterminal B and strings a and b, then C contains all
|
||
items of the form
|
||
|
||
B -> . w
|
||
|
||
for string w. The final step in constructing the canonical LR(0)
|
||
parser consists of constructing the set of states accesible from
|
||
the initial state. The set of accesible states is defined to be
|
||
the smallest set of states containing the initial state such that
|
||
for each state i in S, if j is the successor state of i on some
|
||
symbol x, then j is in S. The successor state j of a state i on
|
||
a symbol x is constructed in two steps: First, for each item in
|
||
state i of the form
|
||
|
||
A -> a . x b
|
||
|
||
for nonterminal A and strings a and b, the item
|
||
|
||
A -> a x . b
|
||
|
||
- 9 -
|
||
|
||
|
||
is added to state j. Second, the closure of the set of items in
|
||
state j is added to state j.
|
||
|
||
6.2 Applying Lookahead to the LR(0) Parser
|
||
|
||
The constructed LR(0) parser will generally contain conflicts,
|
||
that is, states in which more than one action is valid for some
|
||
input symbol. An item of the form
|
||
|
||
A -> a .
|
||
|
||
is called a reduce item (reduction) since it indicates that the
|
||
entire right-hand-side of a rule has been recognized and can be
|
||
reduced to the left-hand-side. An item of the form
|
||
|
||
A -> a . x b
|
||
|
||
where x is a terminal symbol, is called a shift item since it
|
||
indicates that if x is the current input symbol, then it should
|
||
be shifted onto the stack and control passed to the x-successor
|
||
state, which will contain the item
|
||
|
||
A -> a x . b
|
||
|
||
If a state in the LR(0) parser contains a reduce item and one or
|
||
more shift items, or more than one reduce item, then the state
|
||
contains a conflict. Such conflicts may be resolved if it can be
|
||
determined that the reductions are valid only for certain input
|
||
symbols. In any state, if the sets of valid input symbols
|
||
("lookahead sets") for each reduction and the set of terminal
|
||
symbols for which successor states exist are disjoint, then there
|
||
is no conflict in that state, since the parser can determine by
|
||
looking at the current input symbol whether to shift or to
|
||
reduce, and what reduction to make.
|
||
|
||
In YACC, the lookahead sets are computed one terminal symbol at a
|
||
time; that is, for each terminal symbol, it is determined which
|
||
reductions are applicable (contain that terminal symbol in their
|
||
lookahead set). Then, each state of the LR(0) parser is checked
|
||
for conflicts on that terminal symbol. If there are more than
|
||
one applicable reduction, then a reduce/reduce conflict is
|
||
announced. If there is a successor state on that terminal symbol
|
||
and one or more applicable reductions, then a shift/reduce
|
||
conflict is announced.
|