YACC - parser generator.

Binary file originally from ES; TS YACC, timestamped 1978-05-21. The help file is from a TOPS-20 machine; timestamp 1981-08-25.
2026-01-26 04:02:06 +00:00 · 2019-01-28 08:51:49 +01:00
parent 05d4e40333
commit 08f25f90db
3 changed files with 342 additions and 0 deletions
--- a/doc/c/oyacc.hlp
+++ b/doc/c/oyacc.hlp
@@ -0,0 +1,341 @@
+                    YACC - Use and Operation
+                              - 2 -
+
+
+1.  Introduction
+
+This paper describes  the use and  operation of a  LALR(1) parser
+generator YACC (Yet Another Compiler-Compiler).  YACC  accepts as
+input a BNF-like grammar  and, if possible, produces as  output a
+set of  tables for a  table-driven shift-reduce  parsing routine.
+The parsing routine and  the tables together form a  parser which
+recognizes  the  language  defined by  the  grammar.   The parser
+generated by YACC can be used as the core of a syntax analyzer by
+including in the grammar calls to user-provided  action routines.
+These calls are made by  the parser at the appropriate  points in
+the analysis of the input string.
+
+The class of LALR(1) grammars is a subclass of the class of LR(1)
+grammars, those which can be parsed by a  deterministic bottom-up
+parser using one symbol  of lookahead.  The LALR(1)  grammars are
+those LR(1) grammars for which  a parser can be constructed  by a
+relatively efficient  process.  Theoretically,  all deterministic
+context-free languages have a LR(1) grammar, but  not necessarily
+a LALR(1)  grammar.  Practically, however,  it has  been observed
+that most  common programming  languages have  "natural" grammars
+which are easily converted to be LALR(1).
+
+The original YACC was designed and implemented on a PDP-11/45 and
+a  Honeywell 6000  by S.  C. Johnson  at Bell  Laboratories.  The
+version described in this paper was implemented on the  PDP-10 by
+Alan Snyder.
+                              - 3 -
+
+
+2.  Using YACC
+
+In the simplest  case, the input to  YACC is a file  containing a
+BNF-like grammar  for the  language.  The  grammar consists  of a
+sequence of rules, which have the following syntax:
+
+         rule:               lhs ':' rhs_list
+         lhs:                symbol
+         rhs_list:           rhs | rhs_list '|' rhs
+         rhs:                symbol_sequence
+         symbol_sequence:    symbol | symbol_sequence symbol
+ 
+The above rules for rules are examples of rules.  Another example
+is the following simple grammar for expressions:
+
+         e:        e '+' t  |  e '-' t  |  t
+         t:        t '*' p  |  t '/' p  |  p
+         p:        idn  |  '(' e ')'
+ 
+A symbol  is any sequence  of alphanumeric  characters, including
+underlines, dollar signs, and periods.  In addition, a symbol may
+be any sequence of characters enclosed in single quotes.
+
+The symbols which appear as the left-hand-sides of rules  are the
+non-terminal symbols; all other symbols appearing in  the grammar
+are assumed to be terminal symbols.  The symbol appearing  as the
+left-hand-side of the  first rule is  considered to be  the start
+symbol of the grammar.
+
+After a file containing  the grammar  has been prepared,  YACC may  be
+run.  YACC will respond by  asking for the name of the file containing
+the grammar.  After  the file  name is  entered,   YACC will   analyze
+the grammar  and  construct the parsing tables.  YACC will print  some
+messages on  the  terminal to  indicate  its progress.   When  it  has
+finished, a listing will have been placed on the file  YACC OUTPUT and
+the parsing  tables will have been written onto the file YACC TABLES.
+
+In the process  of constructing  a  parser for the  grammar, YACC
+may discover conflicts in the grammar.  These  conflicts indicate
+that the grammar is not LALR(1).  The conflicts, which are listed
+in  the OUTPUT  file, may  be of  two types.   The first  type of
+conflict  is  a   shift/reduce  conflict,  abbreviated   S/R.   A
+shift/reduce conflict indicates that, in the given state and with
+the given input symbol, the constructed parser could legitimately
+either shift the input symbol onto the stack or make an immediate
+reduction.  Shift/reduce conflicts are resolved by YACC  in favor
+of  shifting.  The  second type  of conflict  is  a reduce/reduce
+conflict,  abbreviated R/R.   A reduce/reduce  conflict indicates
+that, in  the given state  and with the  given input  symbol, the
+parser  could  legitimately   make  either  of   two  reductions.
+Reduce/reduce  conflicts are  resolved by  YACC in  favor  of the
+production appearing earlier in the input file.
+
+The relation  of a conflict  to a problem  in the grammar  can be
+                              - 4 -
+
+
+determined by examining  the description of the  particular state
+in the action table section  of the OUTPUT file.  The  first part
+of the description  is a set  of items, where  an item is  a rule
+which contains a marker ('.') in the right-hand-side.  The marker
+indicates how much  of the right-hand-side  has been seen  by the
+parser when the parser is in that state.  Thus, the collection of
+items represents the set of possibilities being considered by the
+parser when in that state.  A conflict indicates that  the parser
+cannot  discard one  of  two possibilities  on the  basis  of the
+current  input symbol,  yet  any action  it takes  will  have the
+effect of eliminating one of the two possibilities.
+                              - 5 -
+
+
+3.  Interfacing with a Lexical Analyzer
+
+The  parsing tables  produced by  YACC  are in  the form  of  a C
+program, ready  to be compiled  by the C  compiler (CC).   This C
+program may  be loaded  together with the  compiled version  of a
+parsing routine in  order to construct  a working parser  for the
+language.  A standard parsing routine, called PARSE, may be found
+in the file "<C>YPARSE.C".
+
+PARSE assumes the existence of a lexical routine,  called GETTOK,
+which it  can call in  order to obtain  the next  terminal symbol
+from the input stream.  GETTOK  is expected to set the  values of
+three integer global  variables, LEXTYPE, LEXINDEX,  and LEXLINE.
+LEXTYPE should  be set  to an  integer which  distinguishes which
+terminal  symbol  has  been  read.   The  correspondence  between
+integers  and  terminal  symbols is  listed  in  the  OUTPUT file
+produced by YACC.  However, it is more convenient when  an actual
+parser  is  to  be  constructed to  specify  in  the  grammar the
+correspondence between  integers and  terminal symbols.   This is
+done by listing at the beginning of the file the terminal symbols
+of the  grammar.  They will  be numbered  consecutively, starting
+with 3.  (The integer 1 is to be returned by the  lexical routine
+to  indicate  the end  of  the  input stream;  the  integer  2 is
+reserved for an error  recovery method.) The listing  of terminal
+symbols in the grammar should be separated from the list of rules
+by the symbol '\\'.  For example, the grammar
+
+         '+' '-' '*' '/' '(' ')' idn
+
+         \\
+
+         e:        e '+' t  |  e '-' t  |  t
+         t:        t '*' p  |  t '/' p  |  p
+         p:        idn  |  '(' e ')'
+ 
+defines the following representations of terminal symbols:
+
+         eof       1
+         +         3
+         -         4
+         *         5
+         /         6
+         (         7
+         )         8
+         idn       9
+ 
+The variable  LEXLINE should  be set  to the  line number  in the
+input file on which the terminal symbol being  returned appeared;
+this value is used by  PARSE when reporting syntax errors  and is
+made available to any action routines.  The variable  LEXINDEX is
+used only when performing translations (see next section).
+
+In addition,  PARSE requires  a routine  PTOKEN which  will print
+some symbolic  representation of  a token;  this routine  is used
+when reporting syntax errors.
+                              - 6 -
+
+
+4.  Performing Translations
+
+As described so far,  the parser performs only  recognition; that
+is, given an  input string of  terminal symbols, it  will produce
+error messages if  the string is not  in the language  defined by
+the grammar and  do nothing otherwise.   YACC is capable  also of
+producing tables  for a parser  which performs  translations, for
+example,  the  syntax  analyzer  of  a  compiler.   The following
+extension is  made in order  to support translation:   the parser
+associates with each  terminal symbol (received from  the lexical
+routine) and each nonterminal symbol (resulting from a reduction)
+a  word (integer,  pointer)  called a  translation  element.  The
+translation  element for  a terminal  symbol is  produced  by the
+lexical  routine;  it is  communicated  to PARSE  via  the global
+variable  LEXINDEX.   Typically, the  translation  element  for a
+terminal  symbol  is   used  to  distinguish   between  different
+identifiers  and  constants.   The  translation  element   for  a
+nonterminal symbol is obtained by calling a  user-provided action
+routine when a reduction  is made which produces  the nonterminal
+symbol.   This  action  routine  is  specified  by  following the
+production  rule in  the grammar  with the  body of  the routine,
+enclosed  in   braces.   The  action   routine  may   access  the
+translation elements  associated with the  symbols on  the right-
+hand-side of the production  using the notation #n, where  "n" is
+the  number of  the symbol  (i.e., #1  refers to  the translation
+element for the first symbol of the right-hand-side).  The action
+routine specifies the value for the left-hand-side by setting the
+global variable VAL.  A typical action routine in a  parser which
+produces tree representations is
+
+         {val=node(node_type,#1,#2,#3);}
+ 
+where node is  a routine which constructs  nodes of the  tree and
+node_type is  a tag  which indicates  the type  of the  node.  An
+action routine  may also specify  a line-number to  be associated
+with the left-hand-side by setting the global variable  LINE; the
+line-numbers of the symbols on the right-hand-side are accessible
+through the global variable  PL (i.e., pl[3] refers to  the line-
+number of the third symbol on the right-hand-side).
+                              - 7 -
+
+
+5.  Disambiguation
+
+YACC is capable of disambiguating ambiguous grammars  through the
+use  of  precedence  and  associativity  information.    This  is
+especially useful in the case of arithmetic expressions  since it
+allows  a much  simpler  grammar to  be used.   For  example, the
+grammar for expressions given above could be written:
+
+         '+' '-' '*' '/' '(' ')' idn
+
+         \<  '+' '-'
+         \<  '*' '/'
+
+         \\
+
+         e:          e '+' e
+                   | e '-' e
+                   | e '*' e
+                   | e '/' e
+                   | idn
+                   | '(' e ')'
+ 
+The two lines following  the list of terminal symbols  create two
+levels of precedence in increasing order and assign  those levels
+to the terminal symbols appearing on those lines.  The '\<' which
+begins a  new precedence  level also  indicates left-association.
+One  may  also  specify  '\>'  for  right-association   and  '\2'
+indicating that association is  not permitted (is to  be regarded
+as a syntax  error).  This last feature  may be used  to prohibit
+the  misleading  association  of  operators  such  as comparision
+operators.
+                              - 8 -
+
+
+6.  The Operation of YACC
+
+The operation  of YACC  is performed in  five steps.   First, the
+input file is read and an internal representation of  the grammar
+is  created.   Second,  certain  auxiliary  data  structures  are
+constructed which contain information about the grammar  which is
+used by later steps.   Third, the canonical LR(0) parser  for the
+grammar is constructed.  Fourth, the LR(0) parser is  analyzed by
+computing and applying lookahead in order to resolve conflicts in
+the LR(0) parser.  Finally, a listing is written onto  the OUTPUT
+file  containing  the  remaining  conflicts  in  the  parser, the
+grammar, and the parser  itself, and the tables are  written onto
+the TABLES file.
+
+6.1  Constructing the Canonical LR(0) Parser
+
+The canonical LR(0) parser for the grammar is constructed  by the
+following method:   First, the grammar  is augmented by  adding a
+production
+
+         $accept:  S -|
+ 
+where the symbol $accept is a distinguished nonterminal  added by
+YACC, S represents the  starting symbol of the  original grammar,
+and -|  represents the end-of-file  symbol.  Second,  the initial
+state of the parser is created containing the item
+
+         $accept  ->  . S -|
+ 
+and its closure.  The closure of  a set of items I is  defined to
+be  the smallest  set of  items  C containing  I such  that  if C
+contains an item of the form
+
+         A  ->  a . B b
+ 
+for some nonterminal B and  strings a and b, then C  contains all
+items of the form
+
+         B  ->  . w
+ 
+for string w.  The final step in constructing the canonical LR(0)
+parser consists of constructing the set of states  accesible from
+the initial state.  The set of accesible states is defined  to be
+the smallest set of states containing the initial state such that
+for each state i in S, if  j is the successor state of i  on some
+symbol x, then j is in S.  The successor state j of a state  i on
+a symbol x is constructed in two steps:  First, for each  item in
+state i of the form
+
+         A  ->  a . x b
+ 
+for nonterminal A and strings a and b, the item
+
+         A  ->  a x . b
+ 
+                              - 9 -
+
+
+is added to state j.  Second, the closure of the set of  items in
+state j is added to state j.
+
+6.2  Applying Lookahead to the LR(0) Parser
+
+The constructed  LR(0) parser  will generally  contain conflicts,
+that is, states in which  more than one action is valid  for some
+input symbol.  An item of the form
+
+         A -> a .
+ 
+is called a reduce  item (reduction) since it indicates  that the
+entire right-hand-side of a  rule has been recognized and  can be
+reduced to the left-hand-side.  An item of the form
+
+         A -> a . x b
+ 
+where x  is a terminal  symbol, is called  a shift item  since it
+indicates that if x is  the current input symbol, then  it should
+be shifted onto the  stack and control passed to  the x-successor
+state, which will contain the item
+
+         A -> a x . b
+ 
+If a state in the LR(0) parser contains a reduce item and  one or
+more shift items,  or more than one  reduce item, then  the state
+contains a conflict.  Such conflicts may be resolved if it can be
+determined that the reductions  are valid only for  certain input
+symbols.   In  any state,  if  the sets  of  valid  input symbols
+("lookahead sets")  for each  reduction and  the set  of terminal
+symbols for which successor states exist are disjoint, then there
+is no conflict in that  state, since the parser can  determine by
+looking  at  the current  input  symbol whether  to  shift  or to
+reduce, and what reduction to make.
+
+In YACC, the lookahead sets are computed one terminal symbol at a
+time; that is, for  each terminal symbol, it is  determined which
+reductions are applicable (contain that terminal symbol  in their
+lookahead set).  Then, each state of the LR(0) parser  is checked
+for conflicts on  that terminal symbol.   If there are  more than
+one  applicable  reduction,  then  a  reduce/reduce  conflict  is
+announced.  If there is a successor state on that terminal symbol
+and  one  or  more  applicable  reductions,  then  a shift/reduce
+conflict is announced.
--- a/doc/programs.md
+++ b/doc/programs.md
@@ -341,6 +341,7 @@
 - XGP, PDP-11 controller for the Xerox Graphics Printer.
 - XGPSPL, spooler for the Xerox Graphics Printer.
 - XXFILE, feed scripted input to a STY session.
+- YACC, parser generator.
 - YAHTZE, the game of Yahtzee.
 - YOW, print Zippyisms.
 - ZAP, dump TV bitmap as an XGP scan file.