PDP-10.its/c20/lex/lex.mem

No Viginia, this file in not empty...


                            DECUS C LANGUAGE SYSTEM


                                      LEX
                          A lexical Analyser Generator


                                       by

                               Charles H. Forsyth

                            University of Waterloo
                            Waterloo, Ontario, N2L 3G1
                            Canada


                                   Revised by

                         Robert B. Denny & Martin Minow


        LEX  transforms  a  regular-expression  grammar  and  associated
        action  routines into a C function and set of tables, yielding a
        table-driven lexical analyser which manages to  be  compact  and
        rapid.


                         DECUS Structured Languages SIG

                              Version of 30-Oct-82


                     Copyright (C) 1978, Charles H. Forsyth


                 Modifications Copyright (C) 1980, 1982, DECUS

            General permission  to  copy  or  modify,  but  not  for
            profit,  is  hereby  granted,  provided  that  the above
            copyright notice is included and reference made  to  the
            fact that reproduction privileges were granted by DECUS.

            The information in this document is  subject  to  change
            without   notice  and  should  not  be  construed  as  a
            commitment by Digital Equipment Corporation or by DECUS.

            Neither Digital Equipment Corporation,  DECUS,  nor  the
            authors   assume  any  responsibility  for  the  use  or
            reliability of this document or the described software.

            This software is  made  available  without  any  support
            whatsoever.     The    person    responsible    for   an
            implementation of this system should expect to  have  to
            understand  and  modify  the source code if any problems
            are  encountered  in  implementing  or  maintaining  the
            compiler or its run-time library.  The DECUS 'Structured
            Languages Special Interest Group' is the  primary  focus
            for communication among users of this software.


        UNIX is  a  trademark  of  Bell  Telephone  Laboratories.   RSX,
        RSTS/E,  RT-11  and  VMS  are  trademarks  of  Digital Equipment
        Corporation.

        A Lexical Analyser Generator                                      Page 3


        1.0  Introduction
             ____________


        A computer program often has an input stream which  is  composed
        of  small elements, such as a stream of characters, and which it
        would like to convert to larger elements in order to process the
        data  conveniently.   A  compiler  is a common example of such a
        program:  it reads a stream of characters forming a program, and
        would  like  to turn this sequence of characters into a sequence
        of larger items, namely identifiers, numbers, and operators, for
        parsing.   In  a  compiler,  the  procedures  which  do this are
        collectively called the  lexical  analyser,  or  scanner;   this
                                 _______  ________       _______
        terminology will be used more generally here.

        It may happen that the speed with which this  transformation  is
        done  will  noticeably affect the speed at which the rest of the
        program operates.  It is certainly true that although such  code
        is  rarely  difficult to write, writing and debugging it is both
        tedious, and time-consuming,  and  one  typically  would  rather
        spend  the  time  on  the hard parts of the program.  It is also
        true that while certain transformations are easily  thought  of,
        they   are  often  hard  to  express  succinctly  in  the  usual
        general-purpose programming languages (eg, the description of  a
        floating-point number).

        LEX is a program which tries to give a programmer a good deal of
        help  in  this,  by  writing large parts of the lexical analyser
        automatically, based on a description supplied by the programmer
        of  the  items  to be recognised (which are known as tokens), as
                                                             ______
        patterns of the more basic elements of the  input  stream.   The
        LEX  description  is  very  much  a special-purpose language for
        writing lexical analysers, and LEX is then simply  a  translator
        for  this  language.  The LEX language is easy to write, and the
        resulting processor is compact and fast running.

        The purpose of a LEX program is to read  an  input  stream,  and
        recognise tokens.  As the lexical analyser will usually exist as
                  ______
        a subroutine in a larger set  of  programs,  it  will  return  a
        "token  number",  which  indicates  which  token  was found, and
        possibly  a  "token  value",  which   provides   more   detailed
        information  about the token (eg, a copy of the token itself, or
        an index into a symbol  table).   This  need  not  be  the  only
        possibility;   a  LEX program is often a good description of the
        structure of the whole computation, and  in  such  a  case,  the
        lexical  analyser might choose to call other routines to perform
        the necessary actions whenever a particular token is recognised,
        without reference to its own caller.
        A Lexical Analyser Generator                                      Page 4


        2.0  The Lex Language
             ___ ___ ________

        LEX transforms a regular-expression grammar into a deterministic
        finite-state  automaton that recognizes that grammar.  Each rule
        of the grammar is associated with  an  action  which  is  to  be
        performed  when  that rule successfully matches some part of the
        input data.

        Because of the nature of regular  expression  grammars,  certain
        language  constructions  cannot  be  recognized by LEX programs.
        Specifically, expressions with balanced  parentheses  cannot  be
        recognized.  This means that LEX cannot be used to recognize all
        Fortran keywords as some  (DO,  IF,  and  FORMAT,  for  example)
        require  elaborate  recognition to distinguish between ambiguous
        constructions.


        2.1  Elementary Things
             __________ ______


        Strings,  characters,  sets  of  characters   called   character
        classes,  and  operators  to  form  these into patterns, are the
        fundamental elements of the LEX language.

        A string is a sequence of  characters,  not  including  newline,
          ______
        enclosed  in  quotes,  or  apostrophes.   Within  a  string, the
        following escape sequences
        (which are those of the C language) allow any 8-bit character to
        be  represented,  including  the  escape  character, quotes, and
        newlines:

                \n      NL      (012)
                \r      CR      (015)
                \b      BS      (010)
                \t      TAB     (011)
                \"      "
                \'      '
                \c      c
                \\      \
                \NNN    (NNN)

        where NNN  is  a  number  in  octal,  and  c  is  any  printable
                                                   _
        character.   A  string may be continued across a line by writing
        the escape character before the newline.

        Outside a string, a sequence of upper-case letters stands for  a
        sequence  of the equivalent lower-case letters, while a sequence
                                    __________
        of lower-case letters is taken as the name of a LEX  expression,
        and  handled  specially,  as described below.  These conventions
        make the  use  of  named  expressions  and  the  description  of
        lower-case  keywords (the usual case on Unix) fairly convenient.
        Keywords in case-independent languages, such as Fortran, require
        additional effort to match, as will be noted.
        A Lexical Analyser Generator                                      Page 5


        An Ascii character other than one of

                () {} [ * | = ; % / \ ' " -

        may be used in LEX to stand for itself.

        A sequence of characters enclosed  by  brackets  ('['  and  ']')
        forms  a  character class, which stands for all those characters
        within the brackets.  If a circumflex ('^') follows the  opening
        bracket,  then  the  class  will  instead  stand  for  all those
        characters but those inside the brackets.  The escapes  used  in
                   ___
        strings may be used in character classes as well.

        Within a character class, the construction "A-B" (where "A"  and
        "B" are arbitrary characters) stands for the range of characters
        between "A" and "B" inclusive.

        For example,

                "ABC"           matches "abc"

                "[ABC]"         matches "A" or "B" or "C"

                "[A-Za-z0-9]"   matches all letters and digits

        Case-independent keyword recognition may be described  by  using
        auxiliary  definitions  to  define expressions that match either
        case.  For example,

                a = [Aa];
                b = [Bb];
                ...
                z = [Zz];
                %%
                d o             Matches "DO", "do", "Do", or "dO"


        2.2  Putting Things Together
             _______ ______ ________


        Several operators are provided to allow construction of  a  type
        of pattern called a regular expression.  Such expressions can be
                            _______ __________
        implemented as finite-state automata (without memory or stacks).
        A  reference  to  an  "occurrence"  of  a  regular expression is
        generally taken to mean an occurrence of any string  matched  by
        that  regular  expression.  The operators are presented in order
        of decreasing priority.  In all cases, operators work on  either
        strings or character classes, or on other regular expressions.

        Any string or character class forms a regular  expression  which
        matches  whatever  the  string or character class stands for (as
        described above).
        A Lexical Analyser Generator                                      Page 6


        The operator '*' applied following a regular expression forms  a
        new  regular  expression  which matches an arbitrary number (ie,
        zero or more) of  adjacent  occurrences  of  the  first  regular
        expression.   The  operation  is  often  referred to as (Kleene)
        closure.

        The operation of concatenation of  two  regular  expressions  is
                         _____________
        expressed  simply by writing the regular expressions adjacent to
        each  other.   The  resulting  regular  expression  matches  any
        occurrence  of the first regular expression followed directly by
        an occurrence of the second regular expression.

        The operator  '|',  alternation,  written  between  two  regular
                            ___________
        expressions   forms   a  regular  expression  which  matches  an
        occurrence of the first regular expression or an  occurrence  of
        the second regular expression.

        Any regular expression may be enclosed in parentheses  to  cause
        the priority of operators to be overridden in the usual manner.

        A few examples should help to make all of this clear:

                "[0-9]*"        matches a (possibly empty)  sequence  of
                                digits.

                "[A-Za-z_$][A-Za-z0-9_$]*"
                                matches a C identifier.

                "([A-Za-z_$]|[0-9])*"
                                matches a C identifier, or a sequence of
                                digits,  or  a  sequence  of letters and
                                digits intermixed, or nothing.


        2.3  The General Form of Lex Programs
             ___ _______ ____ __ ___ ________


        A LEX source input file consists of three sections:   a  section
        containing   auxiliary   definitions,   a   section   containing
                     _________   ___________
        translations, and a section containing programs.
        ____________                           ________

        Throughout a LEX program, spaces, tabs, and newlines may be used
        freely, and PL/1-style comments:

                /* ... anything but '*/' ... */

        may be used, and are treated as a space.

        The auxiliary definition section must be present, separated from
            _________ __________
        following  sections  by the two-character sequence '%%', but may
        be empty.  This  section  allows  definition  of  named  regular
        expressions,  which  provide  the useful ability to use names of
        regular expressions in the  translation  section,  in  place  of
        A Lexical Analyser Generator                                      Page 7


        common sub-expressions, or to make that section more readable.

        The translation section follows the '%%' sequence, and  contains
            ___________
        regular  expressions paired with actions which describe what the
                                         _______
        lexical analyser should do when it discovers an occurrence of  a
        given regular expression in its input stream.

        The program section may be omitted;  if it is present it must be
            _______
        separated from the translation section by the '%%' sequence.  If
        present, it may contain anything in general,  as  it  is  simply
        tacked on to the end of the LEX output file.

        The style of this layout will be familiar to users of Yacc.   As
        LEX  is  often used with that processor, it seemed reasonable to
        keep to a similar format.


        2.4  Auxiliary Definitions
             _________ ___________


        Given the set of regular expressions forming a complete  syntax,
        there  are often common sub-expressions.  LEX allows these to be
        named, defined  but  once,  and  referred  to  by  name  in  any
        subsequent   regular  expression.   Note  that  definition  must
        precede use.  A definition has the form:

                expression_name = regular_expression ;

        where a name is composed of a lower-case letter  followed  by  a
        sequence  string  of letters and digits, and where an underscore
        is considered a letter.  For example,

                digit   = [0-9];
                letter  = [a-zA-Z];
                name    = letter(letter|digit)*;

        The semicolon is needed to resolve some ambiguities in  the  LEX
        syntax.

        Three  auxiliary  definitions  have  special  meaning  to   LEX:
        "break",  "illegal",  and  "ignore."  They  are  all  defined as
        character classes ("break = [,.?]", for example) and are used as
        follows:

                break   An input token will always terminate if a member
                        of the "break" class is scanned.

                illegal The "illegal"  class  allows  simplification  of
                        error detection, as will be described in a later
                        section.  If this  class  is  defined,  and  the
                        lexical  analyser  stops  at  a  character  that
                        "cannot"  occur  in  its  present  context,  the
                        analyser  will  output  a suitable error message
                        and ignore the offender.
        A Lexical Analyser Generator                                      Page 8


                ignore  This class defines a set of characters that  are
                        ignored by the analyser's input routine.


        2.5  Translations
             ____________


        One would like to provide a description  of  the  action  to  be
        taken  when  a  particular sequence of input characters has been
        matched by a given regular expression.  The kind of action taken
        might  vary  considerably, depending upon the application.  In a
        compiler, typical actions are:  enter an identifer into a symbol
        table,  read and store a string, or return a particular token to
        the parser.  In text processing, one  might  wish  to  reproduce
        most  of  the input stream on an output stream unchanged, making
        substitutions when a particular sequence of characters is found.
        In  general,  it  is  hard to predict what form the action might
        take, and so, in LEX the nature of the action  is  left  to  the
        user,  by allowing specification, for each regular expression of
        interest, C-language code to be executed when a string  matching
        that  expression  is  discovered  by  the driving program of the
        lexical  analyser.   An  action,  together  with   its   regular
        expression, is called a translation, and has the format:
                                ___________

                regular_expression { action }

        All of this may be spread across several lines.  The action  may
        be empty, but the braces must appear.

        Earlier, it was argued that most general-purpose  languages  are
        inappropriate for writing lexical analysers, and it is important
        to see that the subsequent use of such a language  to  form  the
        actions  is not a contradiction.  Most languages are fairly good
        at  expressing  the  actions  described  above   (symbol   table
        manipulation,  writing  character  strings,  and such).  Leaving
        this part of the lexical analyser to those  languages  therefore
        not  only  makes  sense,  but also ensures that decisions by the
        writer of the lexical analyser generator will not  unduly  cramp
        the  user's style.  However, general-purpose languages do not as
        a rule provide inexpensive pattern matching facilities, or input
        description formats, appropriate for describing or structuring a
        lexical analyser.

        Allowing a user to provide his own code is not really enough, as
        he  will  need  some  help  from  LEX  to obtain a copy of, or a
        pointer to, the current token, if nothing else.  LEX provides  a
        library  of C functions which may be called to obtain controlled
        access to some of  the  data  structures  used  by  the  driving
        programs  of  the  lexical  analyser.   These are described in a
        later section.
        A Lexical Analyser Generator                                      Page 9


        2.5.1  Numbers and Values -
               _______ ___ ______

        Typically, a lexical analyser will return a value to its  caller
        indicating  which  token has been found.  Within an action, this
        is done by writing a  C  return  statement,  which  returns  the
                                 ______
        appropriate value:

                BEGIN   {
                                return(T_BEGIN);
                        }
                name    {
                                lookup(token(NULL));
                                return(T_NAME);
                        }
                "/"     {
                                return('/');
                        }

        Note that function lookup() is provided by the user program.

        In many cases, other information must be supplied to its  caller
        by  the scanner.  When an identifier is recognised, for example,
        both a pointer to a symbol-table entry,  and  the  token  number
        T_NAME  must  be returned, yet the C return statement can return
                                             ______
        but a single value.  Yacc has a  similar  problem,  and  so  its
        lexical  analyser  sets  an  external word 'yylval' to the token
        value, while the token number is returned by the  scanner.   LEX
        uses  the external 'yylval' (to be compatible), but, to make LEX
        programs more readable when used alone, the name 'lexval' is set
        by a #define statement to 'yylval'.  For example,

                name    {
                                lexval = lookup(token(NULL));
                                return(T_NAME);
                        }


        Certain  token  numbers  are  treated  specially;    these   are
        automatically defined as manifests (see section 3.2) by LEX, and
        all begin with the sequence 'LEX...' so as not to clash with the
        user's own names.  There are two such tokens defined at present:

                LEXSKIP When  returned  by  a  user's  action   routine,
                        LEXSKIP  causes  the  lexical analyser to ignore
                        the current token (ie, it does  not  inform  the
                        parser of its presence), and to look instead for
                        a new token.  This may be used  when  a  comment
                        sequence has been discovered, and discarded.  It
                        is also useful when the action routine completes
                        processing  of the token.  See the discussion of
                        the comment() library function for an example of
                        its usage.

                LEXERR  This  is  returned  by  the   lexical   analyser
                        (function  yylex()) when an unrecognizable input
        A Lexical Analyser Generator                                     Page 10


                        sequence has been detected.  By default,  LEXERR
                        is 256.  This the same as the yacc error value.


        To summarise, the token number is  set  by  the  action  with  a
        return   statement,   and   the   token   value  (ie,  auxiliary
        ______
        information) is set by assigning  this  value  to  the  external
        integer 'lexval'.


        2.6  Declaration Sections
             ___________ ________


        Declarations in the language of the actions may be  included  in
        both  the  auxiliary  definition  section and in the translation
        section.   In  the  former  case,  these  declarations  will  be
        external  to  the lexical analyser, and in the latter case, they
        will be local to the lexical analyser (ie, static, or  automatic
        storage).    Declaration  sections  consist  of  a  sequence  of
        declarations surrounded by the special bracketing sequences '%{'
        and '%}' (as in Yacc).  The characters within these brackets are
        copied unchanged into  the  appropriate  spots  in  the  lexical
        analyser  program  that  LEX writes.  The examples in appendix A
        suggest how these might be used.


        3.0  Using Lex from C
             _____ ___ ____ _


        The present version of LEX is intended for use with C;   and  it
        is this usage which will be described here.


        3.1  The Function yylex()
             ___ ________ _______


        The structure  of  LEX  programs  is  influenced  by  what  Yacc
        requires of its lexical analyser.

        To begin with, the lexical analyser must be named  'yylex',  has
        no  parameters,  and is expected to return a token number, where
        that number is determined by Yacc.   The  token  number  for  an
        Ascii  character  is  its  Ascii  value  (ie,  its  value as a C
        character constant).  Named tokens,  defined  in  yacc  '%token'
        statements,  have a number above 256, with the particular number
        accessible through a Yacc-produced #define of  the  given  token
        name as its number.  Yacc also allows 'yylex' to pass a value to
        the Yacc  action  routines,  by  assigning  that  value  to  the
        external 'yylval'.

        LEX thus provides a lexical  analyser  function  named  'yylex',
        which interprets tables constructed by the LEX program returning
        A Lexical Analyser Generator                                     Page 11


        the token number returned by the actions  it  performs.   Values
        assigned  to  lexval are available in 'yylval', so that use with
        Yacc is straightforward.

        A value of zero is returned by 'yylex' at  end-of-file,  and  in
        the absence of a return statement in an action, a non-zero value
                         ______
        is returned.   If  computation  is  performed  entirely  by  the
        lexical analyser, then a suitable main program would be

                main()
                   {
                   while (yylex()) ;
                   }


        3.2  Serial Re-Use of yylex()
             ______ ______ __ _______


        The  yylex()  function  contains  several  variables  which  are
        statically  initialized  at  compile time.  Once yylex() sees an
        EOF (-1) input character, it will continue to return  NULL.   If
        yylex()  is  to  be  used inside a loop which processes multiple
        files, it must be re-initialized at the beginning  of  each  new
        file  with  a  call  to  the  LEX library routine llinit().  For
        example (slightly extending the previous example):

                main()
                   {
                   getfilelist();
                   for(file = first; file != last; file = next)
                      {
                      llinit();
                      while (yylex());
                      }
                   printf("All files done\n");
                   }

        The call to llinit() is unnecessary if  yylex()  is  to  process
        only one file, or is kept from seeing an EOF input character.


        3.3  The Lex Table File
             ___ ___ _____ ____


        In the absence of instructions to the contrary (see below),  LEX
        reads a given LEX language file, (from the standard input, if an
        input file has not been specified) and produces a C program file
        'lextab.c'  which  largely  consists  of  tables  which are then
        interpreted by 'yylex()' (which is in  the  LEX  library).   The
        actions  supplied  by  the user in each translation are combined
        with a switch statement into a single function, which is  called
               ______
        by  the table interpreter when a particular token is found.  The
        A Lexical Analyser Generator                                     Page 12


        contents of the program section of the LEX file are added at the
        end  of  the  output  file (lextab.c by default).  Normally, LEX
        also inserts the lines

                #include <stdio.h>
                #include <lex.h>

        at the top of the file;  this causes  declarations  required  by
        the  standard  I/O  library and by LEX to be included when the C
        program is compiled.


        3.4  Analyzers Which Don't Use "Standard I/O"
             _________ _____ _____ ___ _________ ____


        With  the  current  release,  LEX  supports  the  generation  of
        analyzers  which  may be incorporated into programs which do not
        use the "standard I/O" library.  By setting the "-s" switch,  as
        shown  below, the generation of the "#include <stdio.h>" line is
        supressed.  All references to standard I/O  specific  files  and
        stdio.h  have  been removed from the LEX library (described in a
        later section), with the  exception  of  lexgetc(),  lexerror(),
        mapch() and lexecho(), which are standard I/O dependent.

        The declaration of yylex()'s input file iov pointer "lexin"  now
        resides  in LEXGET.C (lexgetc()).  The code which defaults lexin
        to stdin has been moved from yylex() to the table file.  yylex()
        now  calls  the  routine  llstin(),  which is generated into the
        table file.  There are no longer any hardwired references to the
        variable   "lextab",  the  default  table  name.   Instead,  LEX
        generates a call to lexswitch() in llstin(),  which  initializes
        yylex()  to use the table whose name was given in a "-t" or "-e"
        option in LEX's command line.  If neither was given, the default
        name  "lextab" is used.  Once the initial table has been set up,
        further automatic calls to lexswitch() are  supressed,  allowing
        the user to manually switch tables as before.

        In addition, If the "-s" switch is not given (i.e.,  normal  use
        with  standard  I/O), llstin() defaults lexin to stdin.  If "-s"
        is given, llstin() is generated to do the lexswitch()  mentioned
        above  only.  In any case, yylex() contains no references to the
        standard I/O system.

        What all of this means is that under normal operation, you won't
        notice  any  change  in LEX's characteristics.  In addition, you
        may use the "-e" ("easy") switch, which will generate a C output
        file  and  LEX tables which (conveniently) have the same name as
        the input file, and everything will get  set  up  automagically.
        If  you  specify the "-s" switch, the table file will contain no
        references to the standard I/O package, and you may use  any  of
        the  lexlib  routines  except  lexgetc(), lexerror(), mapch() or
        lexecho().

        Don't forget that you  must  supply  your  own  startup  routine
        A Lexical Analyser Generator                                     Page 13


        "$$main"  if  you  do not want the standard I/O library.  With a
        bit of care in this regard, it will be  possible  to  link  your
        program  with the C library without dragging in any I/O modules.
        This prevents your having to build another library in  order  to
        access  non-I/O  library  functions.  Just make the reference to
        the C library the last one given to the linker or taskbuilder so
        that  only  those routines which have not already been found are
        pulled from CLIB.


                                      NOTE

            Programs that use LEX-generated analyzers and do not use
            the standard I/O package must supply their own lexgetc()
            and lexerror() routines.  Failure to do so  will  result
            in undefined globals.


        3.5  Operating LEX
             _________ ___


        LEX normally reads the grammar from the standard input,  writing
        the  C  program  to  the  file  'lextab.c'.   It  may be further
        controlled by using the following flags upon invocation:

                -i filename     The grammar is read from 'filename'.

                -o filename     The analyser is written to 'filename'.

                -t tablename    The default  finite-state  automaton  is
                                named  lextab  (and  it  is, by default,
                                written to  file  'lextab.c').   The  -t
                                switch  causes the internal tables to be
                                named 'tablename' and, if the -o  switch
                                is    not   given,   written   to   file
                                'tablename.c'.  This is necessary if the
                                processor-switching         capabilities
                                described in a later section are  to  be
                                used.

                -e name         "Easy"  command  line.    "-e name"   is
                                equivalent to typing

                                   -i name.LXI -o name.C -t name

                                Do not  include  device  names  or  file
                                extensions on the "easy" command line.

                -v [filename]   Internal state information is written to
                                'filename.'   If   not   present,  state
                                information   is   written    to    file
                                'lex.out.'
        A Lexical Analyser Generator                                     Page 14


                -d              Enable various debugging printouts.

                -s              Generate analyzer without references  to
                                standard I/O


        The command line  for  compilation  of  the  table  file  should
        contain no surprises:

                cc -c -O lextab.c       (on Unix)

                xcc lextab -a           (on Dec operating systems)

        but when one is producing  the  running  program,  one  must  be
        careful to include the necessary libraries.  On Unix, the proper
        sequence is:

                cc userprog.o lextab.o -ll -lS

        The '-ll'  causes  the  LEX  library  (described  below)  to  be
        searched,  and  the  '-lS' causes the Standard I/O library to be
        used;  both libraries are required.  If Yacc is  used  as  well,
        the  library  '-ly'  should  be  included before the '-ll'.  The
                                                  ______
        actual order and content of the rest  of  the  command  line  is
        determined by the user's own requirements.

        If using the Decus C compiler, the lexical analyser built by LEX
        is linked with c:lexlib.

        The complete process (assuming the  Decus  compiler  running  on
        RSTS/E in RT11 mode) is thus:

                mcr lex -i grammar.lxi -o grammar.c     ! Build analyser
                cc grammar                              ! Compile the
                as grammar                              ! grammar table
                link out=in,grammar,c:lexlib,c:suport,c:clib/b:2000


        4.0  The Lex Library
             ___ ___ _______


        All programs using grammars generated  by  LEX  must  be  linked
        together  with  the LEX library.  On Unix, this is '/lib/libl.a'
        (or '-ll' on the cc command line) and on DEC operating  systems,
        C:LEXLIB  (LB:[1,1]LEX.OLB for RSX).  It contains routines which
        are either essential or merely useful  to  users  of  LEX.   The
        essential  routines  include  a  routine to obtain a copy of the
        current token, and a routine to switch to  a  different  set  of
        scanning  tables.  Routines of the second, useful, class perform
        functions which might well be written by the user  himself,  but
        are  there  to  save  him  the  bother;   including a routine to
        process various forms of comments and  a  routine  to  transform
        numbers  written  in arbitrary bases.  Both sets of routines are
        A Lexical Analyser Generator                                     Page 15


        expected to grow as LEX sees use.

        Those functions which  produce  diagnostics  do  so  by  calling
        lexerror(), which is called as

                lexerror(string, arg1, ..., argN)

        and is expected to write its arguments (likely using the "remote
        format"  facility  of  the  fprintf()  function),  followed by a
        newline, on  some  output  stream.   A  Lexerror()  function  is
        included  in  the LEX library, but a user is free to include his
        own.  The routine in the LEX library is standard I/O specific.


                                      NOTE

            The VAX/VMS native C library  does  not  support  remote
            formats.   The  Lexerror  function  in  the  LEX library
            conditionally compiles to support a call  to  lexerror()
            with  only  an error message string.  Remote formats are
            supported under Decus C.  Learn to use  them,  they  are
            very nice!


        4.0.1  Comment -- skip over a comment -
               _______ __ ____ ____ _ _______

                comment(delim)
                char delim[];

        Comment() may be called by a translation when  the  sequence  of
        characters which mark the start of a comment in the given syntax
        has been recognised by LEX.  It takes a string which  gives  the
        sequence  of  characters  which  mark  the end of a comment, and
        skips over characters in the input stream until this sequence is
        found.   Newlines  found  while  skipping  characters  cause the
        external 'yyline' to be incremented;  an unexpected  end-of-file
        produces  a  suitable diagnostic.  Thus, 'comment("*/")' matches
        C-style comments, and 'comment("\n")' matches as-style comments.
        There  are  other  methods  of  handling  comments  in LEX;  the
        comment() function is usually the best with regard to both space
        and time.


        4.0.2  Gettoken -- obtain a copy of token -
               ________ __ ______ _ ____ __ _____

                gettoken(buf, sizeof(buf))
                char buf[];

        Gettoken() takes the address of a character buffer, and its size
        in bytes, and copies the token most recently matched by LEX into
        the buffer.  A null byte is added to mark the end of  the  token
        A Lexical Analyser Generator                                     Page 16


        in  the  buffer, but, as null bytes are legitimate characters to
        LEX, the true length of the token is returned by gettoken().

        For example, the following function calls lexlength() to  obtain
        the  length  of a token.  It then calls the storage allocator to
        allocate sufficient storage for the token and copies  the  token
        into the allocated area.

                char *
                save()
                /*
                 * Save current token, return a pointer to it
                 */
                {
                        register char   *tbuffer;
                        register int    len;
                        register char   *tend;
                        extern char     *token();
                        extern char     *copy();

                        len = lexlength() + 1;
                        if (tbuffer = malloc(len)) == NULL)
                                error("No room for token");
                        gettoken(tbuffer, len);
                        return(tbuffer);
                }


        4.0.3  Integ -- long integer, any base -
               _____ __ ____ ________ ___ ____

                long
                integ(nptr, base)
                char *nptr;

        Integ() converts the Ascii string at 'nptr' into a long integer,
        which  it  returns.   Conversion  stops  at the first non-digit,
        where the digits are  taken  from  the  class  "[0-9a-zA-Z]"  as
        limited by the given 'base'.  Integ() does not understand signs,
        nor are blanks or tabs allowed in the string.


        4.0.4  Lexchar -- steal character -
               _______ __ _____ _________

                lexchar()

        Lexchar() returns the next character from the LEX input  stream.
        (This  means  that  LEX  will  no  longer  see  it.)  LEX uses a
        look-ahead buffer to handle complex languages, and this function
        takes this into account.
        A Lexical Analyser Generator                                     Page 17


        4.0.5  Lexecho -- write token to a file (STDIO ONLY) -
               _______ __ _____ _____ __ _ ____ ______ _____

                lexecho(fp);
                FILE *fp;

        Lexecho() may be called by a LEX action  routine  to  write  the
        current token to a specified file.


                                      NOTE

            Programs using analyzers built with  LEX's  "-s"  switch
            must supply their own lexecho() function if needed.


        4.0.6  Lexgetc -- supply characters to yylex (STDIO ONLY) -
               _______ __ ______ __________ __ _____ ______ _____

                lexgetc()

        Lexgetc() is called by the lexical analyser to obtain characters
        from  its input stream.  The version in the library is dependent
        on the standard I/O package, and is:

                FILE *lexin;    /* Declare iov address locally */
                lexgetc()
                {
                        return(getc(lexin));
                }

        If lexin is NULL when yylex() is entered, it will be assigned to
        stdin.   This  is done by yylex() calling the function llstin(),
        which is generated in the table file.  Unless the "-s" switch is
        given  to  LEX,  the llstin() function assigns lexin to stdin if
        lexin is NULL.  If the  "-s"  switch  was  given,  the  llstin()
        routine  is  a  no-op.   The user may provide his own version of
        lexgetc() to pre-process the data to the lexical  analyser.   An
        example of this is shown in the appendix.


                                      NOTE

            Programs using analyzers built with  LEX's  "-s"  switch
            must  supply  their  own lexgetc() function, and "lexin"
            has no meaning in this context.


        A Lexical Analyser Generator                                     Page 18


        4.0.7  Lexlength -- return length of a token. -
               _________ __ ______ ______ __ _ ______

                lexlength();

        Lexlength() may be called by a LEX action routine to obtain  the
        length  of  the  current  token in bytes.  An example of this is
        shown in the description of gettoken().


        4.0.8  Lexpeek -- examine character -
               _______ __ _______ _________

                lexpeek()

        Lexpeek() performs a function similar to that of Lexchar(),  but
        does  not  have  the  side-effect of removing the character from
        LEX's view.


        4.0.9  Lexswitch -- switch scanning tables -
               _________ __ ______ ________ ______

                struct lextab *
                lexswitch(newtb)
                struct lextab *newtb;

        Lexswitch() is called to cause LEX to use a  different  scanning
        table;  it returns a pointer to the one previously in use.  This
        facility is useful if  certain  objects  of  the  language  (eg,
        strings  in  C) have a fairly complicated structure of their own
        which cannot be handled within the translation  section  of  the
        LEX description of the larger language.


        4.0.10  Llinit -- Reinitialize yylex() -
                ______ __ ____________ _______

                llinit()

        Llinit() is a function which resets the state of yylex() to it's
        cold-start   condition.   Several  of  yylex()'s  variables  are
        initialized at compile time, and must be reinitialized if it  is
        to  be serially re-used.  An example of this is where yylex() is
        repeatedly called inside a loop which processes  multiple  input
        files.  Each time a new file is started, llinit() must be called
        before the first call to yylex() for the new file.
        A Lexical Analyser Generator                                     Page 19


        4.0.11  Mapch -- Handle C escapes within strings (STDIO ONLY) -
                _____ __ ______ _ _______ ______ _______ ______ _____

                int mapch(delim, esc)
                char delim;
                char esc;

        Mapch() is a function which handles C "escape"  characters  such
        as "\n" and "\nnn".  It will scan off the entire escape sequence
        and return the equivalent ASCII code as an integer.  It is meant
        for  use  with  YACC while scanning quoted strings and character
        constants.

        If it encounters EOF while  scanning,  it  calls  lexerror()  to
        print  an  error message warning of "Unterminated string".  If a
        normal character is  read,  it  returns  the  ASCII  value.   If
        "delim"  (usually " or ') is read, it returns EOF.  If a newline
        (ASCII linefeed) is read, it increments the global "yyline"  and
        calls itself recursively for the next line of input.  It may use
        _____ ______ ___________
        the ungetc() function to back up in the input stream.


                                      NOTE

            This routine is very application-specific for use by LEX
            and  YACC  when  they  are working together.  You should
            read the code in MAPCH.C before using this function.


        4.0.12  Token -- get pointer to token -
                _____ __ ___ _______ __ _____

                char *
                token(end_pointer)
                char **end_pointer;

        Token() locates the first byte of the current token and  returns
        its  address.   It  takes  an argument which is either NULL or a
        pointer to a character pointer;  if the latter, that pointer  is
        set  to  point  to  the  byte after the last byte of the current
                                      _____
        token.  Token() is slightly faster,  and  more  convenient  than
        gettoken()  for  those  cases where the token is only one or two
        bytes long.


        5.0  Error Detection and Recovery
             _____ _________ ___ ________


        If a character is detected in the input stream which  cannot  be
        added  to  the  last-matched  string,  and  which cannot start a
        string, then that character is considered illegal by  LEX.   LEX
        may be instructed to return a special 'error' token, or to write
        A Lexical Analyser Generator                                     Page 20


        a diagnostic with lexerror().  At present,  the  former  is  the
        default action.

        The token LEXERR is a special value which is recognised by Yacc,
        and causes it to start its own error recovery.  It is defined by
        the header file lex.h for use by other programs.

        Often, it makes more sense to simply type a suitable diagnostic,
        and  continue by ignoring the offending character.  It is fairly
        easy to cause  LEX  to  do  this,  by  including  the  auxiliary
        definition:

                illegal = [\0-\377];

        which defines a  character  class  "illegal"  which  is  handled
        specially  by LEX.  If the character that is causing the trouble
        is a member of that character class (and  in  the  example,  all
        characters  are),  then  LEX will write a diagnostic, and ignore
        it;  otherwise, it will return the special token LEXERR

        More comprehensive  techniques  may  be  added  as  they  become
        apparent.


        6.0  Ambiguity and Look-ahead
             _________ ___ __________

        Many computer languages have ambiguous grammars in that an input
        token  may represent more than one logical entity.  This section
        discusses the  way  in  which  grammars  built  by  LEX  resolve
        ambiguous  input,  as  well  as  a way for the grammar to assign
        unique meaning to a token by looking ahead in the input stream.


        6.1  Resolving Ambiguities
             _________ ___________


        A LEX program may be ambiguous, in the sense that  a  particular
        input  string  or  strings  might  be  matched  by  the  regular
        expression of more than one translation.  Consider,

                [a-z] { putchar(*token(NULL)); }
                aaa*  { printf("abc"); }

        in which the string 'aa' is matched by both regular  expressions
        (twice  by the first, and once by the second).  Also, the string
        'aaaaaa' may be matched in many  different  ways.   LEX  has  to
        decide    somehow    which    actions   should   be   performed.
        (Alternatively, it could produce a diagnostic, and give up.   As
        it happens, LEX never does this.)

        Consider a second example,

                letter = [a-z];
        A Lexical Analyser Generator                                     Page 21


                %%
                A(letter)*      { return(1); }
                AB(letter)*     { return(2); }

        which attempts to distinguish sequences of  letters  that  begin
        with 'a' from similar sequences that begin with 'ab'.  These two
        examples illustrate two different kinds of  ambiguity,  and  the
        following indicates how LEX resolves these.

        In the first example, it seems likely that  the  intent  was  to
        have both 'aa' and 'aaaaaa' perform the second action, while all
        single letters 'a' cause the first action to be performed.   LEX
        does  this  by  ensuring  that  the longest possible part of the
        input stream will be used to determine the match.  Thus,

                <       { return(LESS); }
                <=      { return(LESSEQ); }

        or

                digit(digit)*           { return(NUMBER); }
                letter(letter|digit)*   { return(NAME); }

        would work as one might expect.

        In the second example, the longest-string need not work.  On the
        string  "abb9",  either  action could apply, and so another rule
        must be followed.  This states that if, after the longest-string
        rule  has  been  applied,  there  remains an ambiguity, then the
        action which appears first in the LEX  program  file  is  to  be
        performed.   As the second example is written, the second action
        will never be performed.  It would have been written as:

                letter = [a-z];
                %%
                AB(letter)*     { return(1); }
                A(letter)*      { return(2); }

        The two rules together completely determine a string.

        At present, LEX produces  no  diagnostic  in  either  case;   it
        merely  applies  the  rules  and  proceeds.   In  the case where
        priority is given to the first-appearing rule,  it  might  be  a
        good idea to produce a diagnostic.


        6.2  Look-ahead
             __________


        Some facility for looking ahead in the input stream is sometimes
        required.   (This  facility  might also be regarded as a way for
        the  programmer  to  more  closely   control   LEX's   ambiguity
        resolution  process.)  For example, in C, a name followed by "("
        is to be contextually declared as an external function if it  is
        A Lexical Analyser Generator                                     Page 22


        otherwise undefined.

        In Pascal, look-ahead is required to determine that

                123..1234

        is an  integer  123,  followed  by  the  subrange  symbol  "..",
        followed  by  the  integer 1234, and not simply two real numbers
        run together.

        In both of these cases, the desire is to look ahead in the input
        stream  far  enough  to  be able to make a decision, but without
        losing tokens in the process.

        A special  form  of  regular  expression  is  used  to  indicate
        look-ahead:

                re1 '/' re2     '{' action '}'

        where 're1' and 're2' are regular  expressions.   The  slash  is
        treated  as  concatenation for the purposes of matching incoming
        characters;  thus both 're1' and 're2' must match adjacently for
        the  action  to  be performed.  'Re1' indicates that part of the
        input string which is the token  to  be  returned,  while  're2'
        indicates  the context.  The characters matched by 're2' will be
        re-read at the next call to yylex(), and broken into tokens.

        Note that you cannot write:

                name = re1 / re2;

        The look-ahead operator must be part of the  rule.   It  is  not
        valid in definitions.

        In the first example, the look-ahead operator would be used as:

                name / "(" {
                                if (name undefined)
                                        declare name a global function;
                        }
                name    {       /* usual processing for identifiers */
                        }

        In the second example, the range construction would be parsed as
        follows:

                digit   = [0-9];
                int     = digit(digit)*;
                %%
                int / ".." int  { /* Start of a range */
                ".." int        { /* End of a range   */


        Note that right-context is  not  sufficient  to  handle  certain
        types of ambiguity, as is found in several places in the Fortran
        A Lexical Analyser Generator                                     Page 23


        language.  For example,

                do i = 1        Is an assignment statement
                do i = 1, 4     Is a DO statement

        It is not sufficient to use right-context scanning to  look  for
        the   comma,   as   it   may   occur   within   a  parenthesized
        sub-expression:

                do i = j(k,l)   Is an assignment statement

        In Fortran, similar problems exist for IF and FORMAT statements,
        as  well  as counted (Hollerith) string constants.  All of these
        require a more  powerful  grammar  than  is  possible  with  LEX
        regular-expressions.


        7.0  Multiple Scanning Tables; Processor Switching
             ________ ________ _______ _________ _________


        Even a fairly simple syntax may be difficult, or impossible,  to
        describe  and  process  with  a  single set of translations.  An
        example of this may be found in C, where strings, which are part
        of  the language, have quite a different structure, and in order
        to process them, either a function must be  called  which  reads
        and parses the input stream for itself, or some mechanism within
        LEX must be invoked to  cause  a  (usually  massive)  change  of
        state.

        LEX does provide such a facility, which is known, after AED,  as
        'processor  switching'.   Yylex()  locates  its tables through a
        pointer;  if one simply changes the pointer to point  at  a  new
        set  of  tables,  one  will have effected the required change of
        state.  The LEX library function lexswitch(), which is described
        elsewhere  in  this guide, arranges to do this;  it also returns
        the old value of the pointer so that it may  be  restored  by  a
        later  call  to  Lexswitch.   Thus, scanning environments may be
        stacked, or not, as the user requires.


        7.1  Creation of a Processor
             ________ __ _ _________


        It should be clear that if all the tables produced by LEX from a
        user's translation file have the same name, someone (the loader)
        is bound to object.  Some method must be provided to change  the
        name of the table.

        This is done by an option flag to the LEX command:

                -t name

        will cause the scanning table to be declared as
        A Lexical Analyser Generator                                     Page 24


                struct lextab name;

        so that it may be passed to LEXswitch:

                lexswitch(&name);

        LEX also writes the program file to  the  file  "name.c"  rather
        than to "lextab.c."


                                      NOTE

            If you use the "easy"  command  line  ("-e  name")  when
            running  LEX,  the  output  file  and  table  names will
            correspond nicely.  Re-read the section on operating LEX
            for more details.


        8.0  Conclusion
             __________


        LEX seems to handle most lexical analysis tasks easily.  Indeed,
        LEX   may  be  more  generally  used  to  write  commands  of  a
        text-processing nature;  an example of such usage may  be  found
        in  an  appendix.  LEX programs are far easier to write than the
        equivalent  C  programs,  and  generally  consume   less   space
        (although  there  is  an  initial  overhead for the more general
        table-interpreter  program).   The  encoding  suggested  in  [4]
        achieves   a  reasonable  compromise  between  table  size,  and
        scanning speed.  Certainly lexical analysers  are  less  tedious
        and time-consuming to write.

        It is expected that most change in the future  will  be  through
        additions  to  the  LEX  library.   The  LEX language may change
        slightly to accomodate common kinds  of  processing  (eg,  break
        characters),  or  to  extend  its range of application.  Neither
        kind of change should affect existing LEX programs.

        LEX produces tables and programs for the C language.  The tables
        are  in a very simple (and stylised) format, and when LEX copies
        the action routines or the program section, the  code  might  as
        well  be Fortran for all it cares.  One could write Unix filters
        to  translate  the  very  simple  C  format  tables  into  other
        languages,  allowing  LEX  to  be  used  with a larger number of
        languages, with little extra development  cost.   This  seems  a
        likely future addition.

        Because of the look-ahead necessary to  implement  the  "longest
        string  match"  rule, LEX is unsuitable for interactive programs
        whose overall structure is:

                for (;;) {
        A Lexical Analyser Generator                                     Page 25


                        prompt_user();
                        get_input();
                        process();
                        print_output();
                }

        If these are rewritten as LEX-generated grammars, the user  will
        be  confused  by the fact the second input datum must be entered
        before the first is processed.  It is  possible  to  solve  this
        dilemna   by   rewriting   function   lexgetc()   to  return  an
        "end-of-line" character until processing is  complete  for  that
        line.  An example is shown in the Appendix.


        9.0  Acknowledgements
             ________________


        LEX  is  based  on  a  processor  of  the  same  name  at   Bell
        Laboratories,   which  also  runs  under  Unix  [3],  and,  more
        distantly, on AED-0 [1].  This version of LEX was based  on  the
        description  and suggestions of [4], although the implementation
        differs significantly in a number of ways.


        10.0  References
              __________


             1. Johnson,  W.L.,  et.   al.,  "Automatic  generation   of
                efficient   lexical   analysers   using   finite   state
                techniques", CACM Vol.  11, No.  12, pp.  805-813, 1968.

             2. Johnson, S.C., "Yacc -- Yet Another  Compiler-Compiler",
                CSTR-32,  Bell  Telephone Laboratories, Murray Hill, New
                Jersey, 1974.

             3. Lesk,  M.E.,  "Lex  -  a  lexical  analyser  generator",
                CSTR-39,  Bell  Telephone Laboratories, Murray Hill, New
                Jersey, 1975.

             4. Aho, A.V., Ullman, J.D., Principles of Compiler  Design,
                                         __________ __ ________  _______
                Addison-Wesley, Don Mills, Ontario, 1977.


                                   APPENDIX A

                               LEX SOURCE GRAMMAR


        The following is a  grammar  of  LEX  programs  which  generally
        follows  Bacus-Naur  conventions.  In the rules, "||" stands for
        alternation (choose one  or  the  other).   Other  graphic  text
        stands  for  itself.   Several  grammar  elements  have  special
        meaning:

                <anything>      Any text  not  including  the  following
                                grammar  element  (either  a  literal or
                                end-of-file).

                <nothing>       Nothing  --  used  for   optional   rule
                                elements.

                <name>          A variable name.

                <char_class>    A character class specifier.

                <string>        A string (text inclosed in '"').

                <EOF>           The end of the input file.

        This grammar was  abstracted  from  the  Yacc  grammar  used  to
        describe LEX.

                program :==             aux_section trans_section

                aux_section ::=         auxiliaries %%
                                ||      %%

                auxiliaries ::=         auxiliaries aux_def
                                ||      aux_def

                aux_def ::=             name_def = reg_exp ;
                                ||      %{ <anything> %}

                name_def ::=            <name>

                reg_exp ::=             <char_class>
                                ||      <string>
                                ||      <name>
        A Lexical Analyser Generator                                    Page A-2
        LEX Source Grammar


                                ||      reg_exp *
                                ||      reg_exp | reg_exp
                                ||      reg_exp  reg_exp
                                ||      ( reg_exp )

                trans_section ::=       translations
                                ||      <nothing>

                translations ::=        translations translation
                                ||      translation

                translation ::=         pattern action
                                ||      %{ <anything> %}
                                ||      %% <anything> <EOF>

                pattern ::=             reg_exp / reg_exp
                                ||      reg_exp


                                   APPENDIX B

                              SOME SMALL EXAMPLES


        The following example illustrates  the  use  of  the  look-ahead
        operator, and various other of the nuances of using LEX.


        B.1  A Complete Command
             _ ________ _______


        The C programming language has had two different ways of writing
        its  assignment  operators.   The original method was to write a
        binary operator immediately following  the  ordinary  assignment
        operator, forming a compound operator.  Thus 'a =+ b' caused the
        value of 'a+b' to be assigned to 'a'.  Similarly,

                =- =/ =% =* =<< =>> =| =& =^

        were written  for  the  assignment  operators  corresponding  to
        subtraction,  division,  modulus,  multiplication,  left  shift,
        right shift, logical OR, logical AND, and exclusive OR.  In  the
        current  version of the language, the binary operator is written
        to the left of the  assignment  operator,  to  remove  potential
        ambiguity.

        The LEX program "ctoc"  is  a  filter  which  converts  programs
        written  in  the  older style into programs written in the newer
        style.   It  uses  the  look-ahead  operator,  and  the  various
        dis-ambiguating rules to ensure that sequences like

                a==-1           a=++b

        remain unchanged.
        A Lexical Analyser Generator                                    Page B-2
        Some Small Examples


        /*
         * ctoc.lxi  --  Convert old C operators to new C form
         *
         * Adapted from example in C. Forsythe's LEX manual.
         *
         * NOTE:
         *      Forsythe's program put an entire comment into the token
         *      buffer. Either define a huge token buffer for my typical
         *      monster comments, or filter text within comments as if
         *      it were real C code. This is what I did. So =+ inside
         *      a comment will get changed to +=, etc.  Note tnat you
         *      may use the commen() function in LEXLIB if you want the
         *      comments eaten. I wanted 'em in the output.
         * by
         *      Bob Denny
         *      31-Feb-81
         */
        %{
        char tbuf[80];          /* Token buffer */
        main()
          {
          while (yylex())
            ;
          }
        %}
        any             = [\0-\177];
        nesc            = [^\\];
        nescquote       = [^\\"];
        nescapost       = [^\\'];
        schar           = "\\" any | nescquote;
        cchar           = "\\" any | nescapost;
        string          = '"' schar* '"';
        charcon         = "'" cchar* "'";
        %%
        "=" ( << | >> | "*" | + | - | "/" | "%" | "&" | "|" | "^" )
                        {
                        gettoken(tbuf, sizeof tbuf);
                        printf("%s=",tbuf+1);
                        }
        /*
         * The following will overflow the token buffer on any but a
         * small comment:
         */
        /*********
        "/*" ([^*] | "*"[^/])* "*/"
                {
                        lexecho(stdout);
                }
        **********/
        [<=>!]"=" | "="[<>]
                        {
                        lexecho(stdout);
                        }
        "=" / ( ++ | -- )
        A Lexical Analyser Generator                                    Page B-3
        Some Small Examples


                        {
                        lexecho(stdout);
                        }
        charcon
                        {
                        lexecho(stdout);
                        }
        string
                        {
                        lexecho(stdout);
                        }
        [\0-\377]
                        {
                        lexecho(stdout);
                        }


        Assuming the Decus compiler running on RSTS/E in RT11 mode,  the
        above program would be built and executed as follows:

                mcr     lex -i ctoc.lxi -o ctoc.c
                cc      ctoc/v
                as      ctoc/d
                link    ctoc=ctoc,c:lexlib,c:suport,c:clib/b:2000

                mcr ctoc <old.c >new.c


        B.2  Interactive Lexical Analysis
             ___________ _______ ________

        The following program reads words from  the  terminal,  counting
        each  as they are entered.  The interaction with the operator is
        "natural" in the sense that processing for one line is  complete
        before  the  next  line is input.  To implement this program, it
        was necessary to include a special version  of  lexgetc()  which
        returns   <NULL>   if  the  current  line  has  been  completely
        transmitted to the parser.  Because the parser must  still  have
        some  look-ahead context, it will return the "end-of-line" token
        twice at  the  beginning  of  processing.   This  required  some
        _____
        additional tests in the main program.

        /*
         * Count words -- interactively
         */
        white   = [\n\t ];      /* End of a word        */
        eol     = [\0];         /* End of input line    */
        any     = [!-~];        /* All printing char's  */
        illegal = [\0-\377];    /* Skip over junk       */
        %{
        char            line[133];
        char            *linep          = &line;
        int             iseof           = 0;
        A Lexical Analyser Generator                                    Page B-4
        Some Small Examples


        int             wordct          = 0;
        #define T_EOL   1
        main()
        {
                register int    i;
                while ((i = yylex()) != 0) {
                        /*
                         * If the "end-of-line"  token  is
                         * returned  AND  we're  really at
                         * the end of  a  line,  read  the
                         * next line.  Note that T_EOL is
                         * returned twice when the program
                         * starts because of the nature of
                         * the look-ahead algorithms.
                         */
                        if (i == T_EOL && !is_eof
                                        && *linep == 0) {
                                printf("* ");
                                fflush(stdout);
                                getline();
                        }
                }
                printf("%d words\n", wordct);
        }
        %}
        %%
        any(any)*       {
                                /*
                                 * Write each  word  on  a
                                 * seperate output line.
                                 */
                                lexecho(stdout);
                                printf("\n");
                                wordct++;
                                return(LEXSKIP);
                        }
        eol             {
                                return(T_EOL);
                        }
        white(white)*   {
                                return(LEXSKIP);
                        }
        %%
        getline()
        /*
         * Read a line for lexgetc()
         */
        {
                is_eof = (fgets(line, sizeof line, stdin)
                                == NULL);
                linep = &line;
        }
        lexgetc()
        /*
        A Lexical Analyser Generator                                    Page B-5
        Some Small Examples


         * Homemade  lexgetc  --  return zero while at the
         * end of an input line or EOF at end of file.  If
         * more on this line, return the next byte.
         */
        {
                return(   (is_eof) ?            EOF
                        : (*linep == 0) ?       0
                        :                       *linep++);
        }