PDP-10.its/doc/c/oyacc.hlp

                    YACC - Use and Operation
                              - 2 -


1.  Introduction

This paper describes  the use and  operation of a  LALR(1) parser
generator YACC (Yet Another Compiler-Compiler).  YACC  accepts as
input a BNF-like grammar  and, if possible, produces as  output a
set of  tables for a  table-driven shift-reduce  parsing routine.
The parsing routine and  the tables together form a  parser which
recognizes  the  language  defined by  the  grammar.   The parser
generated by YACC can be used as the core of a syntax analyzer by
including in the grammar calls to user-provided  action routines.
These calls are made by  the parser at the appropriate  points in
the analysis of the input string.

The class of LALR(1) grammars is a subclass of the class of LR(1)
grammars, those which can be parsed by a  deterministic bottom-up
parser using one symbol  of lookahead.  The LALR(1)  grammars are
those LR(1) grammars for which  a parser can be constructed  by a
relatively efficient  process.  Theoretically,  all deterministic
context-free languages have a LR(1) grammar, but  not necessarily
a LALR(1)  grammar.  Practically, however,  it has  been observed
that most  common programming  languages have  "natural" grammars
which are easily converted to be LALR(1).

The original YACC was designed and implemented on a PDP-11/45 and
a  Honeywell 6000  by S.  C. Johnson  at Bell  Laboratories.  The
version described in this paper was implemented on the  PDP-10 by
Alan Snyder.
                              - 3 -


2.  Using YACC

In the simplest  case, the input to  YACC is a file  containing a
BNF-like grammar  for the  language.  The  grammar consists  of a
sequence of rules, which have the following syntax:

         rule:               lhs ':' rhs_list
         lhs:                symbol
         rhs_list:           rhs | rhs_list '|' rhs
         rhs:                symbol_sequence
         symbol_sequence:    symbol | symbol_sequence symbol

The above rules for rules are examples of rules.  Another example
is the following simple grammar for expressions:

         e:        e '+' t  |  e '-' t  |  t
         t:        t '*' p  |  t '/' p  |  p
         p:        idn  |  '(' e ')'

A symbol  is any sequence  of alphanumeric  characters, including
underlines, dollar signs, and periods.  In addition, a symbol may
be any sequence of characters enclosed in single quotes.

The symbols which appear as the left-hand-sides of rules  are the
non-terminal symbols; all other symbols appearing in  the grammar
are assumed to be terminal symbols.  The symbol appearing  as the
left-hand-side of the  first rule is  considered to be  the start
symbol of the grammar.

After a file containing  the grammar  has been prepared,  YACC may  be
run.  YACC will respond by  asking for the name of the file containing
the grammar.  After  the file  name is  entered,   YACC will   analyze
the grammar  and  construct the parsing tables.  YACC will print  some
messages on  the  terminal to  indicate  its progress.   When  it  has
finished, a listing will have been placed on the file  YACC OUTPUT and
the parsing  tables will have been written onto the file YACC TABLES.

In the process  of constructing  a  parser for the  grammar, YACC
may discover conflicts in the grammar.  These  conflicts indicate
that the grammar is not LALR(1).  The conflicts, which are listed
in  the OUTPUT  file, may  be of  two types.   The first  type of
conflict  is  a   shift/reduce  conflict,  abbreviated   S/R.   A
shift/reduce conflict indicates that, in the given state and with
the given input symbol, the constructed parser could legitimately
either shift the input symbol onto the stack or make an immediate
reduction.  Shift/reduce conflicts are resolved by YACC  in favor
of  shifting.  The  second type  of conflict  is  a reduce/reduce
conflict,  abbreviated R/R.   A reduce/reduce  conflict indicates
that, in  the given state  and with the  given input  symbol, the
parser  could  legitimately   make  either  of   two  reductions.
Reduce/reduce  conflicts are  resolved by  YACC in  favor  of the
production appearing earlier in the input file.

The relation  of a conflict  to a problem  in the grammar  can be
                              - 4 -


determined by examining  the description of the  particular state
in the action table section  of the OUTPUT file.  The  first part
of the description  is a set  of items, where  an item is  a rule
which contains a marker ('.') in the right-hand-side.  The marker
indicates how much  of the right-hand-side  has been seen  by the
parser when the parser is in that state.  Thus, the collection of
items represents the set of possibilities being considered by the
parser when in that state.  A conflict indicates that  the parser
cannot  discard one  of  two possibilities  on the  basis  of the
current  input symbol,  yet  any action  it takes  will  have the
effect of eliminating one of the two possibilities.
                              - 5 -


3.  Interfacing with a Lexical Analyzer

The  parsing tables  produced by  YACC  are in  the form  of  a C
program, ready  to be compiled  by the C  compiler (CC).   This C
program may  be loaded  together with the  compiled version  of a
parsing routine in  order to construct  a working parser  for the
language.  A standard parsing routine, called PARSE, may be found
in the file "<C>YPARSE.C".

PARSE assumes the existence of a lexical routine,  called GETTOK,
which it  can call in  order to obtain  the next  terminal symbol
from the input stream.  GETTOK  is expected to set the  values of
three integer global  variables, LEXTYPE, LEXINDEX,  and LEXLINE.
LEXTYPE should  be set  to an  integer which  distinguishes which
terminal  symbol  has  been  read.   The  correspondence  between
integers  and  terminal  symbols is  listed  in  the  OUTPUT file
produced by YACC.  However, it is more convenient when  an actual
parser  is  to  be  constructed to  specify  in  the  grammar the
correspondence between  integers and  terminal symbols.   This is
done by listing at the beginning of the file the terminal symbols
of the  grammar.  They will  be numbered  consecutively, starting
with 3.  (The integer 1 is to be returned by the  lexical routine
to  indicate  the end  of  the  input stream;  the  integer  2 is
reserved for an error  recovery method.) The listing  of terminal
symbols in the grammar should be separated from the list of rules
by the symbol '\\'.  For example, the grammar

         '+' '-' '*' '/' '(' ')' idn

         \\

         e:        e '+' t  |  e '-' t  |  t
         t:        t '*' p  |  t '/' p  |  p
         p:        idn  |  '(' e ')'

defines the following representations of terminal symbols:

         eof       1
         +         3
         -         4
         *         5
         /         6
         (         7
         )         8
         idn       9

The variable  LEXLINE should  be set  to the  line number  in the
input file on which the terminal symbol being  returned appeared;
this value is used by  PARSE when reporting syntax errors  and is
made available to any action routines.  The variable  LEXINDEX is
used only when performing translations (see next section).

In addition,  PARSE requires  a routine  PTOKEN which  will print
some symbolic  representation of  a token;  this routine  is used
when reporting syntax errors.
                              - 6 -


4.  Performing Translations

As described so far,  the parser performs only  recognition; that
is, given an  input string of  terminal symbols, it  will produce
error messages if  the string is not  in the language  defined by
the grammar and  do nothing otherwise.   YACC is capable  also of
producing tables  for a parser  which performs  translations, for
example,  the  syntax  analyzer  of  a  compiler.   The following
extension is  made in order  to support translation:   the parser
associates with each  terminal symbol (received from  the lexical
routine) and each nonterminal symbol (resulting from a reduction)
a  word (integer,  pointer)  called a  translation  element.  The
translation  element for  a terminal  symbol is  produced  by the
lexical  routine;  it is  communicated  to PARSE  via  the global
variable  LEXINDEX.   Typically, the  translation  element  for a
terminal  symbol  is   used  to  distinguish   between  different
identifiers  and  constants.   The  translation  element   for  a
nonterminal symbol is obtained by calling a  user-provided action
routine when a reduction  is made which produces  the nonterminal
symbol.   This  action  routine  is  specified  by  following the
production  rule in  the grammar  with the  body of  the routine,
enclosed  in   braces.   The  action   routine  may   access  the
translation elements  associated with the  symbols on  the right-
hand-side of the production  using the notation #n, where  "n" is
the  number of  the symbol  (i.e., #1  refers to  the translation
element for the first symbol of the right-hand-side).  The action
routine specifies the value for the left-hand-side by setting the
global variable VAL.  A typical action routine in a  parser which
produces tree representations is

         {val=node(node_type,#1,#2,#3);}

where node is  a routine which constructs  nodes of the  tree and
node_type is  a tag  which indicates  the type  of the  node.  An
action routine  may also specify  a line-number to  be associated
with the left-hand-side by setting the global variable  LINE; the
line-numbers of the symbols on the right-hand-side are accessible
through the global variable  PL (i.e., pl[3] refers to  the line-
number of the third symbol on the right-hand-side).
                              - 7 -


5.  Disambiguation

YACC is capable of disambiguating ambiguous grammars  through the
use  of  precedence  and  associativity  information.    This  is
especially useful in the case of arithmetic expressions  since it
allows  a much  simpler  grammar to  be used.   For  example, the
grammar for expressions given above could be written:

         '+' '-' '*' '/' '(' ')' idn

         \<  '+' '-'
         \<  '*' '/'

         \\

         e:          e '+' e
                   | e '-' e
                   | e '*' e
                   | e '/' e
                   | idn
                   | '(' e ')'

The two lines following  the list of terminal symbols  create two
levels of precedence in increasing order and assign  those levels
to the terminal symbols appearing on those lines.  The '\<' which
begins a  new precedence  level also  indicates left-association.
One  may  also  specify  '\>'  for  right-association   and  '\2'
indicating that association is  not permitted (is to  be regarded
as a syntax  error).  This last feature  may be used  to prohibit
the  misleading  association  of  operators  such  as comparision
operators.
                              - 8 -


6.  The Operation of YACC

The operation  of YACC  is performed in  five steps.   First, the
input file is read and an internal representation of  the grammar
is  created.   Second,  certain  auxiliary  data  structures  are
constructed which contain information about the grammar  which is
used by later steps.   Third, the canonical LR(0) parser  for the
grammar is constructed.  Fourth, the LR(0) parser is  analyzed by
computing and applying lookahead in order to resolve conflicts in
the LR(0) parser.  Finally, a listing is written onto  the OUTPUT
file  containing  the  remaining  conflicts  in  the  parser, the
grammar, and the parser  itself, and the tables are  written onto
the TABLES file.

6.1  Constructing the Canonical LR(0) Parser

The canonical LR(0) parser for the grammar is constructed  by the
following method:   First, the grammar  is augmented by  adding a
production

         $accept:  S -|

where the symbol $accept is a distinguished nonterminal  added by
YACC, S represents the  starting symbol of the  original grammar,
and -|  represents the end-of-file  symbol.  Second,  the initial
state of the parser is created containing the item

         $accept  ->  . S -|

and its closure.  The closure of  a set of items I is  defined to
be  the smallest  set of  items  C containing  I such  that  if C
contains an item of the form

         A  ->  a . B b

for some nonterminal B and  strings a and b, then C  contains all
items of the form

         B  ->  . w

for string w.  The final step in constructing the canonical LR(0)
parser consists of constructing the set of states  accesible from
the initial state.  The set of accesible states is defined  to be
the smallest set of states containing the initial state such that
for each state i in S, if  j is the successor state of i  on some
symbol x, then j is in S.  The successor state j of a state  i on
a symbol x is constructed in two steps:  First, for each  item in
state i of the form

         A  ->  a . x b

for nonterminal A and strings a and b, the item

         A  ->  a x . b

                              - 9 -


is added to state j.  Second, the closure of the set of  items in
state j is added to state j.

6.2  Applying Lookahead to the LR(0) Parser

The constructed  LR(0) parser  will generally  contain conflicts,
that is, states in which  more than one action is valid  for some
input symbol.  An item of the form

         A -> a .

is called a reduce  item (reduction) since it indicates  that the
entire right-hand-side of a  rule has been recognized and  can be
reduced to the left-hand-side.  An item of the form

         A -> a . x b

where x  is a terminal  symbol, is called  a shift item  since it
indicates that if x is  the current input symbol, then  it should
be shifted onto the  stack and control passed to  the x-successor
state, which will contain the item

         A -> a x . b

If a state in the LR(0) parser contains a reduce item and  one or
more shift items,  or more than one  reduce item, then  the state
contains a conflict.  Such conflicts may be resolved if it can be
determined that the reductions  are valid only for  certain input
symbols.   In  any state,  if  the sets  of  valid  input symbols
("lookahead sets")  for each  reduction and  the set  of terminal
symbols for which successor states exist are disjoint, then there
is no conflict in that  state, since the parser can  determine by
looking  at  the current  input  symbol whether  to  shift  or to
reduce, and what reduction to make.

In YACC, the lookahead sets are computed one terminal symbol at a
time; that is, for  each terminal symbol, it is  determined which
reductions are applicable (contain that terminal symbol  in their
lookahead set).  Then, each state of the LR(0) parser  is checked
for conflicts on  that terminal symbol.   If there are  more than
one  applicable  reduction,  then  a  reduce/reduce  conflict  is
announced.  If there is a successor state on that terminal symbol
and  one  or  more  applicable  reductions,  then  a shift/reduce
conflict is announced.