YACC - parser generator.

Binary file originally from ES; TS YACC, timestamped 1978-05-21. The help file is from a TOPS-20 machine; timestamp 1981-08-25.
2026-04-25 11:51:38 +00:00 · 2019-01-28 08:51:49 +01:00
parent 05d4e40333
commit 08f25f90db
3 changed files with 342 additions and 0 deletions
--- a/bin/c/ts.yacc
+++ b/bin/c/ts.yacc
--- a/doc/c/oyacc.hlp
+++ b/doc/c/oyacc.hlp
@@ -0,0 +1,341 @@
                    YACC - Use and Operation
                              - 2 -
 1.  Introduction
 This paper describes  the use and  operation of a  LALR(1) parser
 generator YACC (Yet Another Compiler-Compiler).  YACC  accepts as
 input a BNF-like grammar  and, if possible, produces as  output a
 set of  tables for a  table-driven shift-reduce  parsing routine.
 The parsing routine and  the tables together form a  parser which
 recognizes  the  language  defined by  the  grammar.   The parser
 generated by YACC can be used as the core of a syntax analyzer by
 including in the grammar calls to user-provided  action routines.
 These calls are made by  the parser at the appropriate  points in
 the analysis of the input string.
 The class of LALR(1) grammars is a subclass of the class of LR(1)
 grammars, those which can be parsed by a  deterministic bottom-up
 parser using one symbol  of lookahead.  The LALR(1)  grammars are
 those LR(1) grammars for which  a parser can be constructed  by a
 relatively efficient  process.  Theoretically,  all deterministic
 context-free languages have a LR(1) grammar, but  not necessarily
 a LALR(1)  grammar.  Practically, however,  it has  been observed
 that most  common programming  languages have  "natural" grammars
 which are easily converted to be LALR(1).
 The original YACC was designed and implemented on a PDP-11/45 and
 a  Honeywell 6000  by S.  C. Johnson  at Bell  Laboratories.  The
 version described in this paper was implemented on the  PDP-10 by
 Alan Snyder.
                              - 3 -
 2.  Using YACC
 In the simplest  case, the input to  YACC is a file  containing a
 BNF-like grammar  for the  language.  The  grammar consists  of a
 sequence of rules, which have the following syntax:
         rule:               lhs ':' rhs_list
         lhs:                symbol
         rhs_list:           rhs | rhs_list '|' rhs
         rhs:                symbol_sequence
         symbol_sequence:    symbol | symbol_sequence symbol
 The above rules for rules are examples of rules.  Another example
 is the following simple grammar for expressions:
         e:        e '+' t  |  e '-' t  |  t
         t:        t '*' p  |  t '/' p  |  p
         p:        idn  |  '(' e ')'
 A symbol  is any sequence  of alphanumeric  characters, including
 underlines, dollar signs, and periods.  In addition, a symbol may
 be any sequence of characters enclosed in single quotes.
 The symbols which appear as the left-hand-sides of rules  are the
 non-terminal symbols; all other symbols appearing in  the grammar
 are assumed to be terminal symbols.  The symbol appearing  as the
 left-hand-side of the  first rule is  considered to be  the start
 symbol of the grammar.
 After a file containing  the grammar  has been prepared,  YACC may  be
 run.  YACC will respond by  asking for the name of the file containing
 the grammar.  After  the file  name is  entered,   YACC will   analyze
 the grammar  and  construct the parsing tables.  YACC will print  some
 messages on  the  terminal to  indicate  its progress.   When  it  has
 finished, a listing will have been placed on the file  YACC OUTPUT and
 the parsing  tables will have been written onto the file YACC TABLES.
 In the process  of constructing  a  parser for the  grammar, YACC
 may discover conflicts in the grammar.  These  conflicts indicate
 that the grammar is not LALR(1).  The conflicts, which are listed
 in  the OUTPUT  file, may  be of  two types.   The first  type of
 conflict  is  a   shift/reduce  conflict,  abbreviated   S/R.   A
 shift/reduce conflict indicates that, in the given state and with
 the given input symbol, the constructed parser could legitimately
 either shift the input symbol onto the stack or make an immediate
 reduction.  Shift/reduce conflicts are resolved by YACC  in favor
 of  shifting.  The  second type  of conflict  is  a reduce/reduce
 conflict,  abbreviated R/R.   A reduce/reduce  conflict indicates
 that, in  the given state  and with the  given input  symbol, the
 parser  could  legitimately   make  either  of   two  reductions.
 Reduce/reduce  conflicts are  resolved by  YACC in  favor  of the
 production appearing earlier in the input file.
 The relation  of a conflict  to a problem  in the grammar  can be
                              - 4 -
 determined by examining  the description of the  particular state
 in the action table section  of the OUTPUT file.  The  first part
 of the description  is a set  of items, where  an item is  a rule
 which contains a marker ('.') in the right-hand-side.  The marker
 indicates how much  of the right-hand-side  has been seen  by the
 parser when the parser is in that state.  Thus, the collection of
 items represents the set of possibilities being considered by the
 parser when in that state.  A conflict indicates that  the parser
 cannot  discard one  of  two possibilities  on the  basis  of the
 current  input symbol,  yet  any action  it takes  will  have the
 effect of eliminating one of the two possibilities.
                              - 5 -
 3.  Interfacing with a Lexical Analyzer
 The  parsing tables  produced by  YACC  are in  the form  of  a C
 program, ready  to be compiled  by the C  compiler (CC).   This C
 program may  be loaded  together with the  compiled version  of a
 parsing routine in  order to construct  a working parser  for the
 language.  A standard parsing routine, called PARSE, may be found
 in the file "<C>YPARSE.C".
 PARSE assumes the existence of a lexical routine,  called GETTOK,
 which it  can call in  order to obtain  the next  terminal symbol
 from the input stream.  GETTOK  is expected to set the  values of
 three integer global  variables, LEXTYPE, LEXINDEX,  and LEXLINE.
 LEXTYPE should  be set  to an  integer which  distinguishes which
 terminal  symbol  has  been  read.   The  correspondence  between
 integers  and  terminal  symbols is  listed  in  the  OUTPUT file
 produced by YACC.  However, it is more convenient when  an actual
 parser  is  to  be  constructed to  specify  in  the  grammar the
 correspondence between  integers and  terminal symbols.   This is
 done by listing at the beginning of the file the terminal symbols
 of the  grammar.  They will  be numbered  consecutively, starting
 with 3.  (The integer 1 is to be returned by the  lexical routine
 to  indicate  the end  of  the  input stream;  the  integer  2 is
 reserved for an error  recovery method.) The listing  of terminal
 symbols in the grammar should be separated from the list of rules
 by the symbol '\\'.  For example, the grammar
         '+' '-' '*' '/' '(' ')' idn
         \\
         e:        e '+' t  |  e '-' t  |  t
         t:        t '*' p  |  t '/' p  |  p
         p:        idn  |  '(' e ')'
 defines the following representations of terminal symbols:
         eof       1
         +         3
         -         4
         *         5
         /         6
         (         7
         )         8
         idn       9
 The variable  LEXLINE should  be set  to the  line number  in the
 input file on which the terminal symbol being  returned appeared;
 this value is used by  PARSE when reporting syntax errors  and is
 made available to any action routines.  The variable  LEXINDEX is
 used only when performing translations (see next section).
 In addition,  PARSE requires  a routine  PTOKEN which  will print
 some symbolic  representation of  a token;  this routine  is used
 when reporting syntax errors.
                              - 6 -
 4.  Performing Translations
 As described so far,  the parser performs only  recognition; that
 is, given an  input string of  terminal symbols, it  will produce
 error messages if  the string is not  in the language  defined by
 the grammar and  do nothing otherwise.   YACC is capable  also of
 producing tables  for a parser  which performs  translations, for
 example,  the  syntax  analyzer  of  a  compiler.   The following
 extension is  made in order  to support translation:   the parser
 associates with each  terminal symbol (received from  the lexical
 routine) and each nonterminal symbol (resulting from a reduction)
 a  word (integer,  pointer)  called a  translation  element.  The
 translation  element for  a terminal  symbol is  produced  by the
 lexical  routine;  it is  communicated  to PARSE  via  the global
 variable  LEXINDEX.   Typically, the  translation  element  for a
 terminal  symbol  is   used  to  distinguish   between  different
 identifiers  and  constants.   The  translation  element   for  a
 nonterminal symbol is obtained by calling a  user-provided action
 routine when a reduction  is made which produces  the nonterminal
 symbol.   This  action  routine  is  specified  by  following the
 production  rule in  the grammar  with the  body of  the routine,
 enclosed  in   braces.   The  action   routine  may   access  the
 translation elements  associated with the  symbols on  the right-
 hand-side of the production  using the notation #n, where  "n" is
 the  number of  the symbol  (i.e., #1  refers to  the translation
 element for the first symbol of the right-hand-side).  The action
 routine specifies the value for the left-hand-side by setting the
 global variable VAL.  A typical action routine in a  parser which
 produces tree representations is
         {val=node(node_type,#1,#2,#3);}
 where node is  a routine which constructs  nodes of the  tree and
 node_type is  a tag  which indicates  the type  of the  node.  An
 action routine  may also specify  a line-number to  be associated
 with the left-hand-side by setting the global variable  LINE; the
 line-numbers of the symbols on the right-hand-side are accessible
 through the global variable  PL (i.e., pl[3] refers to  the line-
 number of the third symbol on the right-hand-side).
                              - 7 -
 5.  Disambiguation
 YACC is capable of disambiguating ambiguous grammars  through the
 use  of  precedence  and  associativity  information.    This  is
 especially useful in the case of arithmetic expressions  since it
 allows  a much  simpler  grammar to  be used.   For  example, the
 grammar for expressions given above could be written:
         '+' '-' '*' '/' '(' ')' idn
         \<  '+' '-'
         \<  '*' '/'
         \\
         e:          e '+' e
                   | e '-' e
                   | e '*' e
                   | e '/' e
                   | idn
                   | '(' e ')'
 The two lines following  the list of terminal symbols  create two
 levels of precedence in increasing order and assign  those levels
 to the terminal symbols appearing on those lines.  The '\<' which
 begins a  new precedence  level also  indicates left-association.
 One  may  also  specify  '\>'  for  right-association   and  '\2'
 indicating that association is  not permitted (is to  be regarded
 as a syntax  error).  This last feature  may be used  to prohibit
 the  misleading  association  of  operators  such  as comparision
 operators.
                              - 8 -
 6.  The Operation of YACC
 The operation  of YACC  is performed in  five steps.   First, the
 input file is read and an internal representation of  the grammar
 is  created.   Second,  certain  auxiliary  data  structures  are
 constructed which contain information about the grammar  which is
 used by later steps.   Third, the canonical LR(0) parser  for the
 grammar is constructed.  Fourth, the LR(0) parser is  analyzed by
 computing and applying lookahead in order to resolve conflicts in
 the LR(0) parser.  Finally, a listing is written onto  the OUTPUT
 file  containing  the  remaining  conflicts  in  the  parser, the
 grammar, and the parser  itself, and the tables are  written onto
 the TABLES file.
 6.1  Constructing the Canonical LR(0) Parser
 The canonical LR(0) parser for the grammar is constructed  by the
 following method:   First, the grammar  is augmented by  adding a
 production
         $accept:  S -|
 where the symbol $accept is a distinguished nonterminal  added by
 YACC, S represents the  starting symbol of the  original grammar,
 and -|  represents the end-of-file  symbol.  Second,  the initial
 state of the parser is created containing the item
         $accept  ->  . S -|
 and its closure.  The closure of  a set of items I is  defined to
 be  the smallest  set of  items  C containing  I such  that  if C
 contains an item of the form
         A  ->  a . B b
 for some nonterminal B and  strings a and b, then C  contains all
 items of the form
         B  ->  . w
 for string w.  The final step in constructing the canonical LR(0)
 parser consists of constructing the set of states  accesible from
 the initial state.  The set of accesible states is defined  to be
 the smallest set of states containing the initial state such that
 for each state i in S, if  j is the successor state of i  on some
 symbol x, then j is in S.  The successor state j of a state  i on
 a symbol x is constructed in two steps:  First, for each  item in
 state i of the form
         A  ->  a . x b
 for nonterminal A and strings a and b, the item
         A  ->  a x . b
                              - 9 -
 is added to state j.  Second, the closure of the set of  items in
 state j is added to state j.
 6.2  Applying Lookahead to the LR(0) Parser
 The constructed  LR(0) parser  will generally  contain conflicts,
 that is, states in which  more than one action is valid  for some
 input symbol.  An item of the form
         A -> a .
 is called a reduce  item (reduction) since it indicates  that the
 entire right-hand-side of a  rule has been recognized and  can be
 reduced to the left-hand-side.  An item of the form
         A -> a . x b
 where x  is a terminal  symbol, is called  a shift item  since it
 indicates that if x is  the current input symbol, then  it should
 be shifted onto the  stack and control passed to  the x-successor
 state, which will contain the item
         A -> a x . b
 If a state in the LR(0) parser contains a reduce item and  one or
 more shift items,  or more than one  reduce item, then  the state
 contains a conflict.  Such conflicts may be resolved if it can be
 determined that the reductions  are valid only for  certain input
 symbols.   In  any state,  if  the sets  of  valid  input symbols
 ("lookahead sets")  for each  reduction and  the set  of terminal
 symbols for which successor states exist are disjoint, then there
 is no conflict in that  state, since the parser can  determine by
 looking  at  the current  input  symbol whether  to  shift  or to
 reduce, and what reduction to make.
 In YACC, the lookahead sets are computed one terminal symbol at a
 time; that is, for  each terminal symbol, it is  determined which
 reductions are applicable (contain that terminal symbol  in their
 lookahead set).  Then, each state of the LR(0) parser  is checked
 for conflicts on  that terminal symbol.   If there are  more than
 one  applicable  reduction,  then  a  reduce/reduce  conflict  is
 announced.  If there is a successor state on that terminal symbol
 and  one  or  more  applicable  reductions,  then  a shift/reduce
 conflict is announced.
--- a/doc/programs.md
+++ b/doc/programs.md
@@ -341,6 +341,7 @@
 - XGP, PDP-11 controller for the Xerox Graphics Printer.
 - XGPSPL, spooler for the Xerox Graphics Printer.
 - XXFILE, feed scripted input to a STY session.
 - YACC, parser generator.
 - YAHTZE, the game of Yahtzee.
 - YOW, print Zippyisms.
 - ZAP, dump TV bitmap as an XGP scan file.