1
0
mirror of https://github.com/PDP-10/its.git synced 2026-01-13 15:27:28 +00:00
PDP-10.its/doc/c/oyacc.hlp
Lars Brinkhoff 08f25f90db YACC - parser generator.
Binary file originally from ES; TS YACC, timestamped 1978-05-21.  The
help file is from a TOPS-20 machine; timestamp 1981-08-25.
2019-01-28 17:38:47 +01:00

342 lines
15 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

YACC - Use and Operation
- 2 -
1. Introduction
This paper describes the use and operation of a LALR(1) parser
generator YACC (Yet Another Compiler-Compiler). YACC accepts as
input a BNF-like grammar and, if possible, produces as output a
set of tables for a table-driven shift-reduce parsing routine.
The parsing routine and the tables together form a parser which
recognizes the language defined by the grammar. The parser
generated by YACC can be used as the core of a syntax analyzer by
including in the grammar calls to user-provided action routines.
These calls are made by the parser at the appropriate points in
the analysis of the input string.
The class of LALR(1) grammars is a subclass of the class of LR(1)
grammars, those which can be parsed by a deterministic bottom-up
parser using one symbol of lookahead. The LALR(1) grammars are
those LR(1) grammars for which a parser can be constructed by a
relatively efficient process. Theoretically, all deterministic
context-free languages have a LR(1) grammar, but not necessarily
a LALR(1) grammar. Practically, however, it has been observed
that most common programming languages have "natural" grammars
which are easily converted to be LALR(1).
The original YACC was designed and implemented on a PDP-11/45 and
a Honeywell 6000 by S. C. Johnson at Bell Laboratories. The
version described in this paper was implemented on the PDP-10 by
Alan Snyder.
- 3 -
2. Using YACC
In the simplest case, the input to YACC is a file containing a
BNF-like grammar for the language. The grammar consists of a
sequence of rules, which have the following syntax:
rule: lhs ':' rhs_list
lhs: symbol
rhs_list: rhs | rhs_list '|' rhs
rhs: symbol_sequence
symbol_sequence: symbol | symbol_sequence symbol
The above rules for rules are examples of rules. Another example
is the following simple grammar for expressions:
e: e '+' t | e '-' t | t
t: t '*' p | t '/' p | p
p: idn | '(' e ')'
A symbol is any sequence of alphanumeric characters, including
underlines, dollar signs, and periods. In addition, a symbol may
be any sequence of characters enclosed in single quotes.
The symbols which appear as the left-hand-sides of rules are the
non-terminal symbols; all other symbols appearing in the grammar
are assumed to be terminal symbols. The symbol appearing as the
left-hand-side of the first rule is considered to be the start
symbol of the grammar.
After a file containing the grammar has been prepared, YACC may be
run. YACC will respond by asking for the name of the file containing
the grammar. After the file name is entered, YACC will analyze
the grammar and construct the parsing tables. YACC will print some
messages on the terminal to indicate its progress. When it has
finished, a listing will have been placed on the file YACC OUTPUT and
the parsing tables will have been written onto the file YACC TABLES.
In the process of constructing a parser for the grammar, YACC
may discover conflicts in the grammar. These conflicts indicate
that the grammar is not LALR(1). The conflicts, which are listed
in the OUTPUT file, may be of two types. The first type of
conflict is a shift/reduce conflict, abbreviated S/R. A
shift/reduce conflict indicates that, in the given state and with
the given input symbol, the constructed parser could legitimately
either shift the input symbol onto the stack or make an immediate
reduction. Shift/reduce conflicts are resolved by YACC in favor
of shifting. The second type of conflict is a reduce/reduce
conflict, abbreviated R/R. A reduce/reduce conflict indicates
that, in the given state and with the given input symbol, the
parser could legitimately make either of two reductions.
Reduce/reduce conflicts are resolved by YACC in favor of the
production appearing earlier in the input file.
The relation of a conflict to a problem in the grammar can be
- 4 -
determined by examining the description of the particular state
in the action table section of the OUTPUT file. The first part
of the description is a set of items, where an item is a rule
which contains a marker ('.') in the right-hand-side. The marker
indicates how much of the right-hand-side has been seen by the
parser when the parser is in that state. Thus, the collection of
items represents the set of possibilities being considered by the
parser when in that state. A conflict indicates that the parser
cannot discard one of two possibilities on the basis of the
current input symbol, yet any action it takes will have the
effect of eliminating one of the two possibilities.
- 5 -
3. Interfacing with a Lexical Analyzer
The parsing tables produced by YACC are in the form of a C
program, ready to be compiled by the C compiler (CC). This C
program may be loaded together with the compiled version of a
parsing routine in order to construct a working parser for the
language. A standard parsing routine, called PARSE, may be found
in the file "<C>YPARSE.C".
PARSE assumes the existence of a lexical routine, called GETTOK,
which it can call in order to obtain the next terminal symbol
from the input stream. GETTOK is expected to set the values of
three integer global variables, LEXTYPE, LEXINDEX, and LEXLINE.
LEXTYPE should be set to an integer which distinguishes which
terminal symbol has been read. The correspondence between
integers and terminal symbols is listed in the OUTPUT file
produced by YACC. However, it is more convenient when an actual
parser is to be constructed to specify in the grammar the
correspondence between integers and terminal symbols. This is
done by listing at the beginning of the file the terminal symbols
of the grammar. They will be numbered consecutively, starting
with 3. (The integer 1 is to be returned by the lexical routine
to indicate the end of the input stream; the integer 2 is
reserved for an error recovery method.) The listing of terminal
symbols in the grammar should be separated from the list of rules
by the symbol '\\'. For example, the grammar
'+' '-' '*' '/' '(' ')' idn
\\
e: e '+' t | e '-' t | t
t: t '*' p | t '/' p | p
p: idn | '(' e ')'
defines the following representations of terminal symbols:
eof 1
+ 3
- 4
* 5
/ 6
( 7
) 8
idn 9
The variable LEXLINE should be set to the line number in the
input file on which the terminal symbol being returned appeared;
this value is used by PARSE when reporting syntax errors and is
made available to any action routines. The variable LEXINDEX is
used only when performing translations (see next section).
In addition, PARSE requires a routine PTOKEN which will print
some symbolic representation of a token; this routine is used
when reporting syntax errors.
- 6 -
4. Performing Translations
As described so far, the parser performs only recognition; that
is, given an input string of terminal symbols, it will produce
error messages if the string is not in the language defined by
the grammar and do nothing otherwise. YACC is capable also of
producing tables for a parser which performs translations, for
example, the syntax analyzer of a compiler. The following
extension is made in order to support translation: the parser
associates with each terminal symbol (received from the lexical
routine) and each nonterminal symbol (resulting from a reduction)
a word (integer, pointer) called a translation element. The
translation element for a terminal symbol is produced by the
lexical routine; it is communicated to PARSE via the global
variable LEXINDEX. Typically, the translation element for a
terminal symbol is used to distinguish between different
identifiers and constants. The translation element for a
nonterminal symbol is obtained by calling a user-provided action
routine when a reduction is made which produces the nonterminal
symbol. This action routine is specified by following the
production rule in the grammar with the body of the routine,
enclosed in braces. The action routine may access the
translation elements associated with the symbols on the right-
hand-side of the production using the notation #n, where "n" is
the number of the symbol (i.e., #1 refers to the translation
element for the first symbol of the right-hand-side). The action
routine specifies the value for the left-hand-side by setting the
global variable VAL. A typical action routine in a parser which
produces tree representations is
{val=node(node_type,#1,#2,#3);}
where node is a routine which constructs nodes of the tree and
node_type is a tag which indicates the type of the node. An
action routine may also specify a line-number to be associated
with the left-hand-side by setting the global variable LINE; the
line-numbers of the symbols on the right-hand-side are accessible
through the global variable PL (i.e., pl[3] refers to the line-
number of the third symbol on the right-hand-side).
- 7 -
5. Disambiguation
YACC is capable of disambiguating ambiguous grammars through the
use of precedence and associativity information. This is
especially useful in the case of arithmetic expressions since it
allows a much simpler grammar to be used. For example, the
grammar for expressions given above could be written:
'+' '-' '*' '/' '(' ')' idn
\< '+' '-'
\< '*' '/'
\\
e: e '+' e
| e '-' e
| e '*' e
| e '/' e
| idn
| '(' e ')'
The two lines following the list of terminal symbols create two
levels of precedence in increasing order and assign those levels
to the terminal symbols appearing on those lines. The '\<' which
begins a new precedence level also indicates left-association.
One may also specify '\>' for right-association and '\2'
indicating that association is not permitted (is to be regarded
as a syntax error). This last feature may be used to prohibit
the misleading association of operators such as comparision
operators.
- 8 -
6. The Operation of YACC
The operation of YACC is performed in five steps. First, the
input file is read and an internal representation of the grammar
is created. Second, certain auxiliary data structures are
constructed which contain information about the grammar which is
used by later steps. Third, the canonical LR(0) parser for the
grammar is constructed. Fourth, the LR(0) parser is analyzed by
computing and applying lookahead in order to resolve conflicts in
the LR(0) parser. Finally, a listing is written onto the OUTPUT
file containing the remaining conflicts in the parser, the
grammar, and the parser itself, and the tables are written onto
the TABLES file.
6.1 Constructing the Canonical LR(0) Parser
The canonical LR(0) parser for the grammar is constructed by the
following method: First, the grammar is augmented by adding a
production
$accept: S -|
where the symbol $accept is a distinguished nonterminal added by
YACC, S represents the starting symbol of the original grammar,
and -| represents the end-of-file symbol. Second, the initial
state of the parser is created containing the item
$accept -> . S -|
and its closure. The closure of a set of items I is defined to
be the smallest set of items C containing I such that if C
contains an item of the form
A -> a . B b
for some nonterminal B and strings a and b, then C contains all
items of the form
B -> . w
for string w. The final step in constructing the canonical LR(0)
parser consists of constructing the set of states accesible from
the initial state. The set of accesible states is defined to be
the smallest set of states containing the initial state such that
for each state i in S, if j is the successor state of i on some
symbol x, then j is in S. The successor state j of a state i on
a symbol x is constructed in two steps: First, for each item in
state i of the form
A -> a . x b
for nonterminal A and strings a and b, the item
A -> a x . b
- 9 -
is added to state j. Second, the closure of the set of items in
state j is added to state j.
6.2 Applying Lookahead to the LR(0) Parser
The constructed LR(0) parser will generally contain conflicts,
that is, states in which more than one action is valid for some
input symbol. An item of the form
A -> a .
is called a reduce item (reduction) since it indicates that the
entire right-hand-side of a rule has been recognized and can be
reduced to the left-hand-side. An item of the form
A -> a . x b
where x is a terminal symbol, is called a shift item since it
indicates that if x is the current input symbol, then it should
be shifted onto the stack and control passed to the x-successor
state, which will contain the item
A -> a x . b
If a state in the LR(0) parser contains a reduce item and one or
more shift items, or more than one reduce item, then the state
contains a conflict. Such conflicts may be resolved if it can be
determined that the reductions are valid only for certain input
symbols. In any state, if the sets of valid input symbols
("lookahead sets") for each reduction and the set of terminal
symbols for which successor states exist are disjoint, then there
is no conflict in that state, since the parser can determine by
looking at the current input symbol whether to shift or to
reduce, and what reduction to make.
In YACC, the lookahead sets are computed one terminal symbol at a
time; that is, for each terminal symbol, it is determined which
reductions are applicable (contain that terminal symbol in their
lookahead set). Then, each state of the LR(0) parser is checked
for conflicts on that terminal symbol. If there are more than
one applicable reduction, then a reduce/reduce conflict is
announced. If there is a successor state on that terminal symbol
and one or more applicable reductions, then a shift/reduce
conflict is announced.