Random  C  Documentation

                    Copyright (c) 1976, 1977, 1978 by Alan Snyder


          File:  ccdoc.r
          Last Modified:  4 April 1978


          1.  Compiler Structure  (28 Feb 1977)

          The C compiler consists of six logical phases:

                  L - lexical analysis phase (C1)
                  P - parsing phase (C2)
                  C - code generation phase (C3)
                  M - macro expansion phase (C4)
                  E - error message editing phase (C5)
                  S - optional symbol table dumping phase (C8)

          In addition, there may be a control program (CC) that invokes the
          above  phases.    Each of the above phases may be run as separate
          jobs.  However,  a  pair  of  configuration  options  allows  the
          following phases to be merged into single jobs:

                  L and P (called LP)
                  E and CC (called CC)

          The source files which make up the various phases are as follows:

                  L: c1 c92 c95 c96
                  P: c21 c22 c23 c24 c25 c26-xx c91 c95 c96
                  C: c31 c32 c33 c34 c35-xx c91 c94 c95 c96
                  M: c41 c42-xx c43-xx c92 c93 c94 c95 c96
                  E: c5 c93 c96
                  S: c8 c93 c95

          Files with suffixes -xx are target machine dependent.  Files c25,
          c35,  and  c43  are  constructed  from the output of GT.  (A TECO
          macro, install.teco, is provided which does this;  if  you  don't
          know  TECO,  simply use the provided c25, c35, and c43 files as a
          guide.) File c42 contains the C routine macros.
          C Documentation               - 2 -                  4 April 1978


          2.  Intermediate Files  (28 Feb 1977)

          The phases of the compiler communicate using intermediate  files.
          These files may be of several formats:

                  TEXT - consists of lines separated by NEW-LINE
                  CHARACTER   BINARY   -   an  array  of  arbitrary
                       characters, with no  special  interpretation
                       for any characters
                  INTEGER BINARY - an array of arbitrary integers

          Three operations are performed on intermediate files:

                  READ
                  WRITE - causes the file to be created
                  APPEND - prior existence of file assumed

          The following routines are used to peform these operations:

                  TEXT: COPEN, CGETC, CPUTC, CCLOSE
                  CHARACTER BINARY: COPEN, CGETC, CPUTC, CCLOSE
                  INTEGER BINARY: COPEN, GETI, PUTI, CCLOSE

          The  following  table  describes  the files used by the compiler,
          their formats,  and  which  operations  are  performed  by  which
          phases.

                  source    the  source file, provided by the user;
                            TEXT, READ: L
                  output    the  output   file,   whose   name   is
                            determined  from  the source file name;
                            TEXT, WRITE: M
                  TOKEN     the  internal  representation  of   the
                            program in the form of tokens (not used
                            if  LP  merged); INTEGER BINARY, WRITE:
                            L, READ: P
                  CSTORE    holds the source  form  of  identifiers
                            and floating-point constants; CHARACTER
                            BINARY, WRITE: L, READ: M, E, S
                  STRING    holds    character   string   literals;
                            CHARACTER BINARY, WRITE: L, READ: M
                  ERROR     error  messages   in   internal   form;
                            INTEGER  BINARY,  WRITE: CC, APPEND: L,
                            P, C, M, READ: E
                  NODE      internal representation of  program  as
                            syntax trees; INTEGER BINARY, WRITE: P,
                            READ: C
                  TYPTAB    type  table;  INTEGER BINARY, WRITE: P,
                            READ: C, S
                  HMAC      header macros; TEXT, WRITE: P, READ: M
                  MAC       macros;  TEXT,  WRITE:  P,  APPEND:  C,
                            READ: M
          C Documentation               - 3 -                  4 April 1978


                  SYMTAB    symbol  table, written only if a symbol
                            table  listing  is  requested;  INTEGER
                            BINARY, WRITE: P, READ: S

          3.  Library Routines Assumed by the C Compiler

          The following routines are assumed to exist in the C library when
          the C compiler is loaded:

               routine                    description

               fd = copen (s, m, f)       open file for input or output
               c = cgetc (fd)             read character from file
               cputc (c, fd)              output character to file
               b = ceof (fd)              test for end-of-file
               cclose (fd)                close file
               puti (i, fd)               output integer in internal format
               i = geti (fd)              read integer in internal format
               cexit (cc)                 terminate phase with completion code
               p = getvec (n)             allocate block of storage
               b = cisfd (x)              is X a valid file-descriptor (used by CPRINT)

          The  I/O  routines must support read, write, and append mode, and
          TEXT, CHARACTER BINARY, and  INTEGER  BINARY  formats.   For  the
          BINARY  formats,  no formatting transformations may be performed;
          the bits read  must  exactly  equal  the  bits  written  for  all
          possible  values.   The  storage  allocator may be quite trivial;
          note that the storage so allocated is never freed.

          The following external variables are referenced:

               name                       description

               int cout;                  standard output unit
          C Documentation               - 4 -                  4 April 1978


          4.  Tokens

          A token consists of two integer values, a TAG and an INDEX.   The
          token tags are as follows:

                             0           20           40           60

                0                         /          int        entry
                1          eof            *         char     register
                2                         -        float       sizeof
                3            ;            +       double         long
                4            ~            =       struct        short
                5            {            <         auto     unsigned
                6            ]            >       static      typedef
                7            [           ++       extern
                8            )           --       return
                9            (           ==         goto
               10            :           !=           if
               11            ,           <=         else
               12            .           >=       switch
               13            ?           <<        break
               14            ~           >>     continue
               15            !           ->        while          idn
               16            &       asgnop           do       intcon
               17            |           &&          for     floatcon
               18            ^           ||      default       string
               19            %                      case       lineno

          The following token tags are used internal to the lexical phase:

                  LEXEOF      end of compiler control line
                  TCONTROL    start of compiler control line
                  TMARG       <macro argument>

          The interpretation of the INDEX is dependent upon the TAG:

                  TIDN      index of identifier name in CSTORE
                  TINTCON   value of integer constant
                  TFLOATC   index of source representation of float
                            constant in CSTORE
                  TSTRING   index   of   source  representation  of
                            string constant in string file ('\0' is
                            represented as "$0", '$' as "$$")
                  TLINENO   the line number
                  TMARG     the argument number (first one is 0)
                  TEQOP     an index repesenting which =op it is:
          C Documentation               - 5 -                  4 April 1978


                             0 =>>
                             1 =<<
                             2 =+
                             3 =-
                             4 =*
                             5 =/
                             6 =%
                             7 =&
                             8 =^
                             9 =|

                  others    the number of the line upon  which  the
                            token appeared

          5.  Token Intermediate File

          The representation of a token in the token intermediate file is a
          TAG optionally followed by an INDEX.  The various kinds of tokens
          are described in the table below:

               token     tag      index

               eof       1        (none)
               operator  tag      (none)
               keyword   tag      (none)
               asgnop    36       index
               idn       75       index
               intcon    76       value
               floatcon  77       index
               string    78       index
               lineno    79       line number
          C Documentation               - 6 -                  4 April 1978


          6.  Syntax Tree Node Types

          The node types in syntax trees are as follows:

                             0           20           40           80

                0                         %           =%           if
                1          eof            /           =&         goto
                2          idn         *(2)           =^       branch
                3          int         -(2)           =|        label
                4        float            +           &&     stmtlist
                5       string            =           ||       switch
                6         call           ==            .         case
                7            ?           !=            :      default
                8      ++(pre)            <            ,       return
                9     ++(post)            >       sizeof      program
               10      --(pre)           <=                  exprstmt
               11     --(post)           >=                     elist
               12         *(1)           <<
               13         &(1)           >>
               14         -(1)          =>>
               15            ~          =<<
               16            !           =+
               17         &(2)           =-
               18            |           =*
               19            ^           =/

          7.  Syntax Tree Nodes

          A  syntax  tree  node  consists of the node type followed by some
          number of integers and/or pointers to other  syntax  tree  nodes.
          The  node  types  can  be  grouped  into classes according to the
          locations of  pointers  in  the  nodes.    The  following  tables
          describe  the  formats  of  the nodes in each of the various node
          type classes.  The following terms are used:

               <iln>     internal label number
               <e>       pointer to expression tree
               <s>       pointer to statement tree
               <chain>   part of chain of case labels
               <elist>   pointer to expression list node

          Now for the descriptions:

               class 0: no pointers

                    FIELDNAME <index of name in cstore>
                    IDENTIFIER <type> <class> <offset>
                    INTEGER <value>
                    FLOAT <index of source rep in cstore>
                    STRING <index of value in string file>
                    BRANCH <iln>
          C Documentation               - 7 -                  4 April 1978


               class 1: pointer at 2

                    RETURN <lineno> <e>
                    GOTO <lineno> <e>
                    LABEL <iln> <s>
                    EXPRSTMT <lineno> <e>

               class 2: pointers at 2, 3, 4

                    IF <lineno> <e> <s1> <s2>
                    SWITCH <lineno> <e> <s> <chain>

               class 3: pointers at 1, 2

                    STMTLIST <s1> <s2>
                    EXPRLIST <elist> <e>
                    <BINARY-OP> <e1> <e2>
                    FUNCTION-CALL <e> <elist>

               class 4: pointer at 1

                    <UNARY-OP> <e>
                    CASE <chain> <iln> <integer>
                    DEFAULT <chain> <iln>

               class 5: pointer at 3

                    PROGRAM <function-idn> <function-type> <s> <stack-size> <nargs>

          8.  Looping

          The parser converts the three looping constructs (DO, WHILE, FOR)
          into statement lists with labels  and  gotos.    Labels  must  be
          present  for the actual loop, for the continue statement, and for
          the break statement.  These transformations are described below:

               do <stmt> while (<e>); ->          {l3:
                                                  <stmt>
                                                  l2: if (<e>) goto l3;
                                                  l1:;~

                    where continue = goto l2 and break = goto l1.

               while (<e>) <stmt> ->              {l2:
                                                  if (e) {<stmt> goto l2;~
                                                  l1:;~

                    where continue = goto l2 and break = goto l1.
          C Documentation               - 8 -                  4 April 1978


               for (<e1>;<e2>;<e3>) <stmt> ->     {<e1>;
                                                  l3: if (<e2>)
                                                     {<stmt>
                                                     l2: <e3>;
                                                     goto l3;
                                                     ~
                                                  l1:;~

                    where continue = goto l2 and break = goto l1.

          9.  Symbol Table Format  (28 Feb 1977)

          This section describes the format of the symbol table.  There are
          two separate dictionaries, the global dictionary  and  the  local
          dictionary.    The global dictionary is used to hold declarations
          that are made at top-level (not within a function definition) and
          all extern declarations.  The local dictionary is  used  to  hold
          all non-extern declarations made within a function definition.

          The  local  definition  is used only while processing a function;
          when  the  processing  of  the  function  terminates,  the  local
          dictionary is emptied.  The local dictionary grows and shrinks as
          blocks  are  entered  and  exited.    When a block is exited, the
          declarations local  to  that  block  are  checked  for  undefined
          structure  types  and  unused  automatic  variables.  Entries for
          undefined  labels  are  percolated  to  the  next  higher  level;
          undefined labels are checked at the end of the function.

          In order to minimize the number of ways that the compiler can run
          out of space, both dictionaries are allocated in one array
          of dictionary entries called DICT, as follows:

                                   _________________________________
                  DBEGIN      --> |                                 |
                                  |  GLOBAL DEFINITIONS             |
                                  |_________________________________|
                  DGDP        --> |                                 |
                                  |  FREE                           |
                                  |                                 |
                                  |_________________________________|
                  DLDP        --> |                                 |
                                  |  LOCAL DEFINITIONS              |
                                  |_________________________________|
                  DEND        -->
          C Documentation               - 9 -                  4 April 1978


          10.  Types  (28 Feb 1977)

          All  C  types  are  represented  by  pointers  into  a type table
          (TYPTAB).  The format of the entry consists of a TAG, a SIZE,  an
          ALIGNMENT  class,  plus  some  other  values whose interpretation
          depends on the TAG.  There are TAGs for  the  basic  types  (e.g.
          INT),  plus  the  type modifiers (e.g. POINTER).  The extra value
          for most type modifiers  is  another  TYPE  (i.e.  a  pointer  to
          somewhere  else  in  the TYPTAB).  Types not involving structures
          are uniquely represented, so that type equality can  be  done  by
          comparing  the  pointers.    Type  equality  involving  recursive
          structures is not worth the effort, and not really needed anyway.
          An entry for a structure type contains a pointer  to  a  list  of
          field definitions, allocated from the bottom of TYPTAB.

          Writing  the  type  table  onto  an  intermediate  file  involves
          converting the pointers to offsets.

          11.  Machine Description Updates  (26 Apr 1976)

          The following lists changes to the machine description language.

          1.   There  is  a  new,   optional   definition   statement,
               OFFSETRANGE,  which  specifies  allowable  offsets  (in
               addressable units)  in  indirect  references.    Offset
               ranges  may  be  specified for each pointer class.  The
               default is to allow only zero offsets.  The form of the
               statement is

                offsetrange px (lo, hi), ... ;

               The lo and hi are integers which specify the lowest and
               highest allowable offsets for  indirect  references  of
               class  px.  Either or both of lo and hi may be omitted,
               meaning that there is no bound in that direction.   The
               OFFSETRANGE  statement  should  immediately  follow the
               POINTER statement.

          12.  AMOP Updates  (2 Apr 1977)

          The following lists changes to the set of AMOPs.

          1.   There are two new AMOPs, LSEQ (0010) and COMMA  (0170).
               Both  take  two arbitrary expressions and evaluate them
               in order.    LSEQ  returns  its  first  operand,  COMMA
               returns its second operand.

          2.   The increment and decrement operators are now optional.

          3.   New,  optional  AMOPS  have been added for ++ and -- to
               floats and doubles.
          C Documentation              - 10 -                  4 April 1978


          4.   New, optional AMOPS have  been  added  to  specify  the
               passing  of function arguments (in particular, to allow
               pushing function arguments on a stack).  The AMOPS  are
               .ARGI, .ARGD, .ARG0, .ARG1, .ARG2, .ARG3, for integers,
               doubles  (if passed directly), and pointers.  Each AMOP
               takes one operand, the actual argument, and  should  be
               specified to leave its result in the same location.  If
               these  AMOPs  are defined, then the code generator does
               not allocate any stack space for argument  lists.    It
               simply invokes one of these operators for each argument
               (from  left to right) and assumes that they do whatever
               is necessary.  When these  AMOPs  are  used,  the  ARGP
               argument to the CALL macro is meaningless.

          13.  Keyword Macro Updates  (3 Apr 1977)

          The following lists changes to the set of keyword macros.

          1.   DATA:   %DA()

          The  DATA macro indicates that the following macros define impure
          data areas which are not explicitly initialized.

          2.   IMPURE:   %IM()

          The IMPURE macro indicates that following  macros  define  impure
          data  areas  which  are  not explicitly initialized (and thus are
          implicitly initialized to zero).

          3.   PDATA:   %PD()

          The PDATA macro indicates that following macros define pure  data
          areas, namely string literals.

          4.   PURE:   %PU()

          The PURE macro indicates that following macros define pure code.

          5.   ALIGNn:   %ALn()

          Instead  of  a  single  ALIGN  macro,  there  are now three ALIGN
          macros, for alignment classes 1, 2, and 3.  Alignment class 0  is
          the alignment for characters.

          6.   ADCONn:   %An(0,BASE,OFFSET)

          The  arguments to ADCON are now a base and offset, which together
          make up a REF that describes  either  an  external  or  a  static
          variable or function.
          C Documentation              - 11 -                  4 April 1978


          7.   PROLOG:   %P(FUNCNO,FUNCNAME,FNARGS)

          The  FNARGS  argument  has been added: it specifies the number of
          formal arguments to the function.

          8.   RETURN:   %RT(FUNCNO)

          The FUNCNO argument has been added  to  allow  reference  to  the
          corresponding EPILOG code.

          9.   STATIC:   %ST(N)

          The  size  argument has been flushed, as it could not be reliably
          computed, anyway.

          10.   ELSWITCH:   %ELS(N,LBASE,LOFFSET,IBASE,IOFFSET)

          This new macro terminates the lists following LSWITCH macros.

          11.   ETSWITCH:   %ETS(LO,LBASE,LOFFSET,IBASE,IOFFSET,HI)

          This new macro terminates the lists following TSWITCH macros.

          14.  Pointer Comparison  (15 Dec 1974)

          The following comparison operations are allowed on pointers:

                  1. A pointer may be compared  with  any  pointer,
                     using any of the comparison operations.
                  2. A  pointer  may  be  tested  for  equality  or
                     inequality with 0.

          In the first case, we have the operation p1:p2 -> l, where p1 and
          p2 are the  pointers  and  l  is  the  destination  label.    The
          algorithm is as follows:

                  if (ctype (p1) < ctype (p2)) exchange (p1, p2);
                  if (ctype (p1) > ctype (p2))
                          convert p1 to ctype (p2);
                  apply operator of ctype (p2);

          In  the  second  case, we have the operation p:0 -> l, where p is
          the pointer and l is the destination label.  The  default  method
          of  handling  such  an  operation is to convert the pointer to an
          integer and apply the integer comparison operation.  However,  as
          an  optimization, we allow the definition of special null-pointer
          testing operations, ==0px and !=0px, for any pointer class x.

          Note that the pointer-to-integer conversions are not supposed  to
          really  change  anything;  their  main  purpose  is to accomplish
          moving the pointer value from a pointer register  to  an  integer
          register  (where  necessary).  In the case of comparing a pointer
          to zero, this move may not be necessary as  machine  instructions
          C Documentation              - 12 -                  4 April 1978


          to  do  the  test on the pointer register may exist; that is what
          the optional pointer comparison operators are for.

          15.  Changes to GT  (28 Dec 1974)

          Two new tables are output by GT for use by  the  code  generation
          phase.  These tables are OPREG[] and OPMEM[].  The tables contain
          one  word  for each AMOP; the bits of the word indicate registers
          in OPREG and memory reference classes in OPMEM.  The bits are set
          to indicate which locations are possible result locations for the
          AMOPs.

          The following entries are pre-defined (G indicates set by  GT,  R
          indicates   set  by  the  code  generator  as  part  of  run-time
          initialization):

               AMOP           value                              set

               *u             any indirect (of appropriate type) R
               !, &&, ||      all integer registers              G
               comparison     all integer registers              G
               =, ?           any register (of appropriate type) R
               call           retreg of appropriate type         R
               e_int          intlit                             G
               e_float        floatlit                           G
               e_string       stringlit                          G
               e_idn          appropriate idn class              R

          16.  Scoring Algorithm  (28 Dec 1974)

          This  section  describes  the  scoring  algorithm  used  in   the
          selection  of  OPLOCs.    It  is based on a cost of 4 units for a
          register-register move and 5 units for  a  register-memory  move.
          An  impossible  situation is scored at -1000.  The total score is
          the sum  of  all  applicable  scores  derived  by  the  following
          algorithm:

          For each clobbered register which is busy, -10 (2 RM).

          Considering  the  possible  result  locations  of  the  top-level
          operator:

             If the desired result location is a label but the top-level
             operator never has a label result,  -1000.   Otherwise,  if
             the desired result location is register, then:

                If some desired register is possible and at least one
                possible  desired  register  is  free  or  all of the
                desired registers are busy then 0 (there will  be  no
                penalty  for  picking this OPLOC, on the basis of the
                result location, at least).   Otherwise,  if  a  busy
                desired  register  is  possible  and  not all desired
                registers are busy, then -10 (2 RM).  Otherwise, if a
          C Documentation              - 13 -                  4 April 1978


                non-desired free register is  possible,  -4  (1  RR).
                Otherwise,   if   a   non-desired  busy  register  is
                possible, -14 (1 RR + 2 RM).  Otherwise, if memory is
                a possible result location, -5 (1 RM).

             Otherwise, if the desired result location is memory, then:

                If all of the possible result locations are  desired,
                then 0.  Otherwise, if a temporary is desirable and a
                free  register is a possible result location, then -5
                (1 RM).  Otherwise, if a temporary is desirable and a
                busy register is a possible result location, then -15
                (3 RM).  Otherwise, -1000.

          Considering the possible locations of each of the operands of the
          top-level operator:

             If the operand is desired in a  register  and  the  operand
             location   is   identically  the  result  location  of  the
             top-level operator, then if no desired registers are  free,
             then  -10  (2 RM).  If the operand is desired in a register
             and the operand's top-level operator may return its  result
             in  a  register,  then  if there is no register which meets
             both requirements, -4 (1 RR).  If the operand is desired in
             a register  and  the  operand's  top-level  operator  never
             returns its result in a register, then -5 (1 RM).

             If  the  operand  is desired in memory and all non-indirect
             memory  reference  classes  are  acceptable,  then  if  the
             operand's  top-level  operator  never returns its result in
             memory, -5 (1 RM).  If the operand is desired in memory and
             there are one  or  more  unacceptable  non-indirect  memory
             reference  classes,  then  if  all  possible  memory result
             locations of  the  operand's  top-level  operator  are  not
             acceptable  or  if  the  operand's  top-level  operator may
             return its result in a register and a temporary location is
             not acceptable, -1000.

          17.  Changes to GT  (31 Dec 1974)

          The distinction  in  the  interpretation  of  the  #  indicators,
          depending upon whether or not they appear in a macro call or not,
          no  longer  exists.  The #X indicators now always refer to a call
          on the NAME macro.   New  #'X  indicators  are  recognized  which
          always refer to the macro argument sequences.