mirror of
https://github.com/IanDarwin/OpenLookCDROM.git
synced 2026-05-05 15:44:08 +00:00
504 lines
23 KiB
Plaintext
504 lines
23 KiB
Plaintext
Article 37298 of comp.lang.c:
|
|
Xref: sq comp.lang.c:37298 comp.unix.questions:29345 comp.unix.wizards:1734 comp.lang.misc:7537
|
|
Path: sq!geac!lethe!torsqnt!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!uunet!mcsun!unido!mikros!mwtech!martin
|
|
From: martin@mwtech.UUCP (Martin Weitzel)
|
|
Newsgroups: comp.lang.c,comp.unix.questions,comp.unix.wizards,comp.lang.misc
|
|
Subject: Re: FAQ for lex + yacc? (was Re: using lex with strings, not files)
|
|
Keywords: lex, yacc, strings
|
|
Message-ID: <1139@mwtech.UUCP>
|
|
Date: 12 May 91 20:11:48 GMT
|
|
References: <5384@lectroid.sw.stratus.com> <1137@mwtech.UUCP>
|
|
Reply-To: martin@mwtech.UUCP (Martin Weitzel)
|
|
Organization: MIKROS Systemware, Darmstadt/W-Germany
|
|
Lines: 487
|
|
|
|
Here I follow up to my own article. Thanks to all who responded by mail,
|
|
especially to Steve Summit who wrote a very long answer and made valuable
|
|
proposuals.
|
|
|
|
For now I've decided to post the stuff once more since there seems to
|
|
be sufficient request for it. I even included one more new Q+A on the
|
|
request of one person who mailed me. (BTW: I'm still open to include
|
|
more Q+As in future releases.)
|
|
|
|
There seems to be some consensus NOT to make a REGULAR posting of it.
|
|
Instead I request all who archive FAQ-like stuff to do so with this
|
|
article. If you do and your archive has an easy way to be accessed from
|
|
the public (ftp, mail-server), let me know the procedure how to do so,
|
|
so that I can point future requests for the answers to this archive
|
|
and must neither mail all this to individuals, nor use up bandwitdh in
|
|
the news with it.
|
|
|
|
BTW: Sometime ago someone came up with the idea of a - probably moderated -
|
|
news group to cross-post FAQ-like stuff so that it could be archived
|
|
automatically or at least given a long expiration period. I'm not sure how
|
|
this proposual advanced (and I'm even not sure if such a group is a good
|
|
idea in general) but for the stuff I have it sounds all right.
|
|
|
|
============ Now, for all who are waiting, here we go again ==============
|
|
|
|
|
|
1.) How does one redefine the I/O in a yacc/lex piece of code? I.e.
|
|
the code which is generated defaults to stdin and stdout for input
|
|
and output, respectively.
|
|
|
|
The calling tree in a lex+yacc application, when it comes to read
|
|
input and you do not change anything, normally is:
|
|
|
|
(main or whatever)
|
|
---> yyparse
|
|
---> yylex
|
|
---> input[Macro]
|
|
---> getc(yyin) [yyin defaults to stdin]
|
|
|
|
If you want to read from some other source instead of stdin, you have
|
|
several points where you can change the behavior. (In the very simplest
|
|
case you could even change nothing and use the input redirection of UNIX.)
|
|
|
|
If your program generated by lex has to read from another source as stdin,
|
|
just fopen() the file and assign the returned FILE-pointer to yyin. The
|
|
latter is a global symbol in the object file that results from compiling
|
|
what lex generated. If you prefer separate compilation you must define
|
|
it as
|
|
extern FILE *yyin;
|
|
|
|
in the module where you assign to it. The right place for this will
|
|
probably be the function which calls yyparse in the above example. Note
|
|
that the main program from the yacc-library is NOT linked if you supply
|
|
your own. So you can play any games you like before (and also after)
|
|
calling yyparse.
|
|
|
|
Next step of complexity is to change the input-macro of yylex. This
|
|
is useful sometimes, but you should be reluctant to do so until you have
|
|
gathered a bit experience with lex and understand the implications. Here
|
|
are some details how the individual routines and functions call each other
|
|
(note that the situation described is reverse engineered from several
|
|
implementations under XENIX and SysV UNIX - it *may* differ with other
|
|
implementations, most notably "flex"):
|
|
|
|
main (from lex-library or own)
|
|
|
|
|
V
|
|
yylex --------------------------+-+
|
|
+---+ | | +-----+ | |
|
|
| V V V V | ...... V V .........
|
|
| input unput | : yyless yyreject : in the
|
|
| | :.... | ... | . | ..: lex-library
|
|
| | V V V
|
|
| | yyunput yyinput
|
|
| | | |
|
|
| +--------------------+ |
|
|
+------------------------------------------------+
|
|
|
|
What you should note here first is:
|
|
|
|
- When the next character is needed in yylex, input (NOT yyinput!!)
|
|
is called. Normally, input is #defined as macro but you can
|
|
re-#define it, or #undef it and make a function with this name
|
|
visible when you compile and link lex.yy.c.
|
|
|
|
- There is another macro, unput, that must properly undo the actions
|
|
of input, though unput is only called if your regular expressions
|
|
require some look-ahead. (If you are not *very* experienced with
|
|
regular expressions, assume that there will *always* will be
|
|
look-ahead.)
|
|
|
|
So if you want to change things at this level, you must find the right
|
|
place for your redefinitions, that is, you must write them somewhere
|
|
into the ".l"-file (with the lex-source), so that they appear *after*
|
|
the #define-s that are automatically generated by lex, but *before* the
|
|
first use of input/unput. As the order in which the parts of your ".l"-file
|
|
appear in lex.yy.c seem to have changed with the evolution of lex, you
|
|
should check for the right place if you try this the first time!
|
|
|
|
(It appears as if the "safest" (i.e. most portable) place for this is
|
|
right at the beginning of the second part of the ".l"-file, immediately
|
|
before the first regular expression.)
|
|
|
|
file.l
|
|
------------------------------------------------------------
|
|
first part
|
|
%%
|
|
%{
|
|
#undef input /* ANSI-C requires that, though */
|
|
#undef unput /* older compilers may do without */
|
|
#define input .... whatever ....
|
|
#define unput ... as you like ..
|
|
%}
|
|
first-regex {
|
|
... action ...
|
|
}
|
|
second-regex {
|
|
.... action ...
|
|
}
|
|
... etc. ......
|
|
%%
|
|
third part
|
|
------------------------------------------------------------
|
|
|
|
Now for the tricky part: As you see from the picture above which shows how
|
|
the functions call each other, there are some routines in the lex-library
|
|
which need sometimes to input or unput characters. These routines *must*
|
|
use exactly the redefined versions of input and unput. But: How can these
|
|
library routines link to something that is defined as macro?
|
|
|
|
The solution can again be seen if you carefully study lex.yy.c, the
|
|
source generated by lex. At the end of lex.yy.c you find the functions
|
|
yyinput and yyunput (note the yy-prefix now!), which do no more and no
|
|
less than calling input and unput. As the two functions are COMPILED
|
|
in the context of lex.yy.c, where the (possibly changed) macro-definitions
|
|
are visible, they build the "stubs" through which the functions in the
|
|
lex-library access the macros. (Quite easy after you understand the
|
|
trick but not at all obvious for the newcomer.)
|
|
|
|
** AGAIN: Look at the above scheme showing the calling hierarchy and
|
|
** try to understand the dependencies. Eventually study lex.yy.c a bit
|
|
** further. THEN you might consider writing your own input/unput macros!
|
|
|
|
Finally, you can consider avoiding lex at all and roll your own yylex.
|
|
If you have only the "ancient" lex which is supplied with most unix systems
|
|
(contrary to the rewrite "flex"), it could eventually be an advantage to
|
|
do so, since lex-generated programs are known to be not so much efficient
|
|
compared to hand-written scanners. (It's not easy to decide whether the
|
|
performance loss in lex generated programs is a legend or reality. Most
|
|
comparisons are based on trivial scanners which are easily written by hand.
|
|
In any case it seems recommendable to use lex during the early development
|
|
steps as prototyping tool and eventually change to hand-written scanners
|
|
when things are becoming more stable.)
|
|
|
|
For the redirection of output there is no problem at all, since this is
|
|
fully under control of the C-program fragments you write in your actions
|
|
of the lex+yacc source.
|
|
2.) How can I get the automagically-defined #defines, which can normally
|
|
be created from yacc with the -d flag, to come out when you use a
|
|
Makefile? I.e. suppose I have scanner.l and grammar.y lex and yacc
|
|
source files, respectively, and I have object files defined in my
|
|
Makefile called scanner.o and grammar.o such that "make" follows
|
|
default rules to create these from the aforementioned source files.
|
|
|
|
If you use lex+yacc with the UNIX tool make, you can add your own explicit
|
|
dependencies or change the default suffix rules and add your own commands
|
|
there. There is no "catch all" method for this - several variations exist,
|
|
all with their specific advantages and drawbacks - but if you know make or
|
|
are willing to learn about make, you can determine the dependencies between
|
|
your files generated by lex+yacc in the same way as with any other source
|
|
files. You'll also find a framework for using lex+yacc together with
|
|
make in the book "The Unix Programming Environment" by Kernighan + Pike
|
|
[K&P]. Though only one chapter (but not the smallest) is dedicated to
|
|
lex+yacc, it is recommendable to get this book anyway - it pays, not only
|
|
for your work with lex+yacc.)
|
|
|
|
Now let's look a little closer to the problem. If you have a file grammar.y
|
|
and tell make that a file grammar.o is needed, make "knows" by its builtin
|
|
rules how to call yacc and rename and compile the resulting y.tab.c. If you
|
|
need the definitions of the %tokens and the value stack %union in a separate
|
|
file - say grammar.h - a first attempt may be to add the following explicit
|
|
rule to the Makefile:
|
|
|
|
grammar.h: grammar.y
|
|
yacc -d grammar.y
|
|
mv y.tab.h grammar.h
|
|
rm -f y.tab.c
|
|
|
|
Obviously, here is some work duplicated, as yacc will run twice now: Once
|
|
for producing grammar.h in the explicit rule, once to produce grammar.o
|
|
with the suffix rule. This is no real problem because (in general) yacc
|
|
runs very fast compared to the compilation step.
|
|
|
|
If you are really concerned about avoiding unnecessary CPU time, you
|
|
should consider some other fine point: Generally, the situation is that
|
|
changes in grammar.y will be much more reflected in y.tab.c than in
|
|
y.tab.h (respectively grammar.h in the above example). The latter will
|
|
only occur when new tokens are introduced or the type of the value stack
|
|
is changed. Typically, the latter two changes to grammar.y are much less
|
|
frequently done compared to changing some statements in the C-program
|
|
fragments that specify the actions for the rules. So it is recommendable
|
|
to extend the above further to:
|
|
|
|
grammar.h: grammar.y
|
|
yacc -d grammar.y
|
|
test -f grammar.h && cmp -s y.tab.h grammar.h ||\
|
|
mv y.tab.h grammar.h
|
|
rm -f y.tab.c
|
|
|
|
Here y.tab.h is only renamed to grammar.h if the latter doesn't exist
|
|
OR is different from the former. Leaving an existing unchanged grammar.h
|
|
alone will avoid many unnecessary recompilations of other modules which
|
|
depend on grammar.h. (If you think you have understood the last sentence,
|
|
think about it and make sure that you really did! The problem only shows
|
|
up if many *other* modules *depend* on grammar.h!)
|
|
|
|
As a slight modification of the above, you may also write a "cc"-like
|
|
shell wrapper for yacc, which generates - controlled by options - either
|
|
".c", ".o", or ".h" files from given ".y" files. Of course, this wrapper
|
|
should only conditionally rename y.tab.h, as shown above in the excerpt
|
|
from a Makefile.
|
|
|
|
{short digression on}
|
|
Some people may recommend to simply change the definition of YFLAGS in
|
|
the Makefile to `YFLAGS = -d', so that y.tab.h is produced as "side
|
|
effect" when the suffix rule gets executed. Note that this has several
|
|
drawbacks: The default name for the definitions is now y.tab.h, which is
|
|
more a work file than the name of something useful and you cannot manage
|
|
different yacc projects in the same directory without the danger of
|
|
overwriting this file by accident. Of course you can circumvent this by
|
|
changing the actions in the suffix rule and rename y.tab.h to grammar.h
|
|
there - but in this case you have already changed more than the definition
|
|
of YFLAGS.
|
|
|
|
But there's still one more problem: Make doesn't know through this how
|
|
grammar.h (or y.tab.h, if you do not rename it) can be made! All it knows
|
|
through the suffix rules is how to make grammar.o from grammar.y. So if
|
|
you have an explicit dependency elsewhere, which tells that some other target
|
|
depends on grammar.h, make will complain when it had not accidentally
|
|
decided to make grammar.o before that. (If you know make well, you may
|
|
write the dependencies in a way that y.tab.h is generated early enough,
|
|
but why worry about such details when the alternative solution shown above
|
|
also is not too complex.)
|
|
{short digression off}
|
|
3.) My grammar contains the following rule
|
|
|
|
line3 : A B C
|
|
{ yacc action sequence }
|
|
|
|
|
|
which indicates that the construct line3 is composed of the 3 tokens
|
|
A B and C, in that order.
|
|
|
|
How can I now assign the values of A, B, and C into local variables of
|
|
my choice? The problem lies in the fact that each of A B and C represent
|
|
three calls to lex, and if I pass back a pointer to yytext from lex, I
|
|
only retain the value of the last token in the sequence, in this case C,
|
|
when I get to the action sequence in my yacc code. What if I want to
|
|
be able to select the EXACT ascii tokens for each of A B and C above in
|
|
my yacc code. how do I do that?
|
|
|
|
Transferring strings from yylex to yyparse (respectively the action which
|
|
has the relevant tokens on the RHS of its grammar rule) must be done with
|
|
care: Simply using pointers to yytext is no solution here - you must copy
|
|
the contents of yytext to a safe place. For that purpose you could malloc
|
|
some space in the action of yylex (not yyparse!!) which recognizes the token,
|
|
as shown in the code excerpt following below. (Your C standard library may
|
|
also contain strdup which does malloc and strcpy all in one, but it's not
|
|
difficult to do without.) Of course you must be careful here:
|
|
|
|
- malloc may return a NULL-pointer because of memory limits;
|
|
ALWAYS CHECK FOR THIS CONDITION. (And even if you think this
|
|
will never occur on your machine since you have lots of MB-s
|
|
available - one day someone else may take your code and port
|
|
it to some other environment and the program will fail!)
|
|
|
|
- you must not forget to allocate space for the terminating
|
|
NUL-byte, hence malloc(yylen + 1) is the right thing;
|
|
|
|
- you must carefully plan for deallocation if your program
|
|
shall not run out of memory when it analyzes large input.
|
|
|
|
If you transfer pointers to the malloc-ed space via the value stack, the
|
|
last chance for free-ing the space is immediately before the value stack
|
|
is cleared. So, if you don't copy the pointers which correspond to A, B,
|
|
and C in the above example, your last chance is in the grammar action.
|
|
A short code excerpt should help to understand what is required:
|
|
|
|
strsave-source
|
|
------------------------------------------------------------------
|
|
char *strsave(s, n) char *s;
|
|
{
|
|
char *t = malloc(n + 1);
|
|
if (t == (char *)0) {
|
|
scream and shout and die horrible death
|
|
}
|
|
strcpy(t, s);
|
|
return t;
|
|
}
|
|
------------------------------------------------------------------
|
|
|
|
lex-source
|
|
------------------------------------------------------------------
|
|
........
|
|
%%
|
|
regex-for-token-A {
|
|
yylval.str = strsave(yytext, yylen);
|
|
return(A);
|
|
}
|
|
regex-for-token-B {
|
|
yylval.str = strsave(yytext, yylen);
|
|
return(B);
|
|
}
|
|
regex-for-token-C {
|
|
yylval.str = strsave(yytext, yylen);
|
|
return(C);
|
|
}
|
|
------------------------------------------------------------------
|
|
|
|
yacc-source
|
|
------------------------------------------------------------------
|
|
......
|
|
%union {
|
|
.....
|
|
char *str;
|
|
.....
|
|
}
|
|
......
|
|
%token <str> A B C
|
|
......
|
|
%%
|
|
......
|
|
line3 : A B C {
|
|
$1, $2, and $3 are now pointers to "safe" copies of
|
|
the original token strings, but if you don't copy
|
|
these pointers to variables which SURVIVE THIS BLOCK,
|
|
you must cleanup before this action ends:
|
|
free($1);
|
|
free($2);
|
|
free($3);
|
|
Be especially careful if you create multiple
|
|
references to the malloc-ed space or if you transfer
|
|
one of these further out, say: $$ = $1. In this
|
|
case you must of course NOT free $1 here! Instead
|
|
the action(s) of the rule(s) where the nonterminal
|
|
"line3" appears on the RHS must take over that
|
|
responsibility.
|
|
}
|
|
......
|
|
-------------------------------------------------------------------
|
|
|
|
***************************************************
|
|
********* WARNING: Dangerous Code ahead ***********
|
|
***************************************************
|
|
|
|
Some people successfully solve the problem in principle as follows:
|
|
|
|
yacc-source
|
|
-------------------------------------------------------------------
|
|
.......
|
|
%%
|
|
........
|
|
line3:
|
|
A { atext = strsave(yytext, yylen); }
|
|
B { btext = strsave(yytext, yylen); }
|
|
C { ctext = strsave(yytext, yylen); }
|
|
........
|
|
-------------------------------------------------------------------
|
|
|
|
This introduces an obvious minor problem and a by far less obvious major
|
|
problem:
|
|
|
|
First, if your grammar is not trivial, writing actions WITHIN the RHS of
|
|
the rules effectively changes the grammar and introduces new rules with
|
|
an empty RHS. Because of that yacc may complain about about conflicts now
|
|
and you must have a good understanding of the algorithm of yyparse, to decide
|
|
that the grammar does what you want. This is the minor problem and if you
|
|
are lucky, yacc sees no conflicts and you may think the above works.
|
|
|
|
But be warned if you don't have a thorough understanding under which
|
|
circumstances yyparse needs to have a look-ahead symbol to continue:
|
|
Never, again NEVER, again ***NEVER*** depend on an unchanged contents
|
|
of yytext in the actions of yyparse: In yyparse the calls to yylex which
|
|
in turn change the contents of yytext are slightly "asynchronous", i.e.
|
|
there might be a read ahead of one token and yytext doesn't contain what
|
|
you think! But there's not ALWAYS a read-ahead, it depends whether yyparse
|
|
needs one more token to decide what to do further. So it may occur that
|
|
your grammar works fine with the above and after a minor change you make
|
|
during further developing your project, there will be the need for look
|
|
ahead and you run in big trouble!
|
|
|
|
Concerning access to yytext (and some other yy-variables), a good rule for
|
|
professional coding is the following:
|
|
|
|
** If you combine lex+yacc generated programs, the only place where yytext
|
|
** is valid is in the action block following the regular expression in the
|
|
** ".l"-file. Never access yytext (or any other variables that belong to
|
|
** the parts of the program generated by lex) within the actions of the
|
|
** grammar rules.
|
|
4.) How do I use two yacc grammars in the same program? The problem is,
|
|
when I try to link several modules create by lex and/or yacc, the
|
|
linker complains about "name clashes".
|
|
|
|
With respect to this problem, there is some bad and some good news.
|
|
|
|
|
|
The bad news is that obviously lex+yacc were not designed with this in
|
|
mind, i.e. there are plenty of globally visible symbols which would cause
|
|
the aforementioned complaints from the linker, when you try to compile and
|
|
link several lex and/or yacc generated sources together into one program.
|
|
This is a bit of a pity, since every possible work-around is bound to be
|
|
more complicated and less comfortable as when provisions for that were
|
|
made in the design of lex+yacc. (Note that there may be modified versions
|
|
of lex+yacc locally available, which help you with the steps described
|
|
in the rest of the answer.)
|
|
|
|
The good news, however, is that lex+yacc both generate C-source, which
|
|
you may massage in any desired way before you continue to compile and
|
|
link the modules. Further, all the globally visible names (that make it
|
|
impossible to link any two unchanged modules together), start with the
|
|
unique prefix "yy". So changing this prefix from "yy" to something else
|
|
is the key to the solution you need.
|
|
|
|
Of course, you don't want to go through the lex.yy.c or yy.tab.c file and
|
|
make all these changes "per hand" each time you generated one of the files
|
|
anew, so the challenge is to find a way to apply the changes automatically.
|
|
Two possibilities immediately spring into mind: Using the UNIX tool "sed"
|
|
or using some #define-s and let the C-preprocessor hide the original names.
|
|
|
|
The first solution is straight forward if you first identify all global
|
|
yy-symbols in the program, then build your sed-script which changes them
|
|
on all occurrences in each line. Of course you can (and should) include the
|
|
call to this sed-program into the Makefile that coordinates the compilation
|
|
steps of your project.
|
|
|
|
The other possibility is to #define all the "yy"-symbols to something
|
|
unique. Again you must first identify what globally symbols are affected
|
|
by this. For a program generated by yacc you may finally come up with a
|
|
file that reads more or less as follows:
|
|
|
|
zzgrammar.c
|
|
-------------------------------------------------------------
|
|
#define yyparse zzparse
|
|
#define yystate zzstate
|
|
#define yynerr zznerr
|
|
..... etc. etc. .....
|
|
#include y.tab.c
|
|
-------------------------------------------------------------
|
|
|
|
The compiler proper now doesn't see any names beginning with "yy" but
|
|
only those beginning with "zz". (Of course you should consider choosing
|
|
a more meaningful prefix - but be careful: Not all linkers use all
|
|
characters to distinguish external names - ANSI-C for example requires
|
|
global symbols only to differ within the first six characters with no
|
|
distinction between upper and lower case, and many older UNIX linkers
|
|
have a limit of seven characters here.)
|
|
|
|
No matter which way you go to change the global names (sed or #define),
|
|
there is one further important point: If you change the names of any
|
|
functions which come from the library, you must also supply a version
|
|
of this function yourself. (Be sure to understand the calling hierarchy
|
|
of the functions - the topic is treated in another of these lex+yacc FAQs.)
|
|
|
|
Fortunately, the library functions are absolutely trivial in case of yacc.
|
|
If you supply your own main program (which is the only meaningful thing
|
|
when you link several yacc generated modules together), the only routine
|
|
which may still come from the yacc-library is yyerror. It is usually
|
|
called with the string "syntax error" and does nothing but print this
|
|
string to stderr.
|
|
|
|
The lex-library contains a little more - besides the main program (which
|
|
you surely don't need when calling yylex from yyparse), the absolutely
|
|
trivial routine yywrap, and finally two functions which are called when
|
|
you use yyless or REJECT in the actions of your rules. If you are not sure
|
|
what these functions really do, consider changing the regular expressions
|
|
and actions so that these two are never used. (Sorry, you have been told
|
|
that the work-arounds are not so comfortable as they might be if the case
|
|
of more than one yylex linked into a single program were considered in the
|
|
early design of lex+yacc.)
|
|
|
|
Martin Weitzel
|
|
EDV-Beratung
|
|
Heidelberger Str. 197
|
|
D-W-6100 Darmstadt
|
|
Germany
|
|
--
|
|
Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83
|
|
|
|
|