Lex and YACC primer/HOWTO

来源:百度文库 编辑:神马文学网 时间:2024/04/26 18:49:08

Lex and YACC primer/HOWTO

PowerDNS BV (bert hubert )

v0.8 $Date: 2004/09/20 07:14:23 $
This document tries to help you get started using Lex and YACC

1. Introduction

Welcome, gentle reader.

If you have been programming for any length of time in a Unix environment,you will have encountered the mystical programs Lex & YACC, or as theyare known to GNU/Linux users worldwide, Flex & Bison, where Flex is aLex implementation by Vern Paxson and Bison the GNU version of YACC. We willcall these programs Lex and YACC throughout - the newer versions areupwardly compatible, so you can use Flex and Bison when trying our examples.

These programs are massively useful, but as with your C compiler, theirmanpage does not explain the language they understand, nor how to use them.YACC is really amazing when used in combination with Lex, however, the Bisonmanpage does not describe how to integrate Lex generated code with yourBison program.

1.1 What this document is NOT

There are several great books which deal with Lex & YACC. By all meansread these books if you need to know more. They provide far more informationthan we ever will. See the 'Further Reading' section at the end. Thisdocument is aimed at bootstrapping your use of Lex& YACC, to allow you to create your first programs.

The documentation that comes with Flex and BISON is also excellent, but notutorial. They do complement my HOWTO very well though. They too arereferenced at the end.

I am by no means a YACC/Lex expert. When I started writing this document, Ihad exactly two days of experience. All I want to accomplish is to makethose two days easier for you.

In no way expect the HOWTO to show proper YACC and Lex style. Exampleshave been kept very simple and there may be better ways to write them. Ifyou know how to, please let me know.

1.2 Downloading stuff

Please note that you can download all the examples shown, which are inmachine readable form. See thehomepage for details.

1.3 License

Copyright (c) 2001 by bert hubert. This material may bedistributed only subject to the terms and conditions set forth in the OpenPublication License, vX.Y or later (the latest version is presentlyavailable at http://www.opencontent.org/openpub/).

2. What Lex & YACC can do for you

When properly used, these programs allow you to parse complex languages withease. This is a great boon when you want to read a configuration file, orwant to write a compiler for any language you (or anyone else) might haveinvented.

With a little help, which this document will hopefully provide, you willfind that you will never write a parser again by hand - Lex & YACC arethe tools to do this.

2.1 What each program does on its own

Although these programs shine when used together, they each serve adifferent purpose. The next chapter will explain what each part does.

3. Lex

The program Lex generates a so called `Lexer'. This is a function that takesa stream of characters as its input, and whenever it sees a group ofcharacters that match a key, takes a certain action. A very simple example:

%{
#include
%}

%%
stop printf("Stop command received\n");
start printf("Start command received\n");
%%

The first section, in between the %{ and %} pair is included directly in theoutput program. We need this, because we use printf later on, which isdefined in stdio.h.

Sections are separated using '%%', so the first line of the second sectionstarts with the 'stop' key. Whenever the 'stop' key is encountered in theinput, the rest of the line (a printf() call) is executed.

Besides 'stop', we've also defined 'start', which otherwise does mostly thesame.

We terminate the code section with '%%' again.

To compile Example 1, do this:

lex example1.l
cc lex.yy.c -o example1 -ll

NOTE: If you are using flex, instead of lex, you may have to change '-ll'to '-lfl' in the compilation scripts. RedHat 6.x and SuSE need this, even whenyou invoke 'flex' as 'lex'!

This will generate the file 'example1'. If you run it, it waits for you totype some input. Whenever you type something that is not matched by any ofthe defined keys (ie, 'stop' and 'start') it's output again. If youenter 'stop' it will output 'Stop command received';

Terminate with a EOF (^D).

You may wonder how the program runs, as we didn't define a main() function.This function is defined for you in libl (liblex) which we compiled in withthe -ll command.

3.1 Regular expressions in matches

This example wasn't very useful in itself, and our next one won't be either.It will however show how to use regular expressions in Lex, which aremassively useful later on.

Example 2:

%{
#include
%}

%%
[0123456789]+ printf("NUMBER\n");
[a-zA-Z][a-zA-Z0-9]* printf("WORD\n");
%%

This Lex file describes two kinds of matches (tokens): WORDs and NUMBERs.Regular expressions can be pretty daunting but with only a little work it iseasy to understand them. Let's examine the NUMBER match:

[0123456789]+

This says: a sequence of one or more characters from the group 0123456789.We could also have written it shorter as:

[0-9]+

Now, the WORD match is somewhat more involved:

[a-zA-Z][a-zA-Z0-9]*

The first part matches 1 and only 1 character that is between 'a' and 'z',or between 'A' and 'Z'. In other words, a letter. This initial letter thenneeds to be followed by zero or more characters which are either a letter ora digit. Why use an asterisk here? The '+' signifies 1 or more matches, buta WORD might very well consist of only one character, which we've alreadymatched. So the second part may have zero matches, so we write a '*'.

This way, we've mimicked the behaviour of many programming languages whichdemand that a variable name *must* start with a letter, but can containdigits afterwards. In other words, 'temperature1' is a valid name,but '1temperature' is not.

Try compiling Example 2, lust like Example 1, and feed it some text. Here isa sample session:

$ ./example2
foo
WORD

bar
WORD

123
NUMBER

bar123
WORD

123bar
NUMBER
WORD

You may also be wondering where all this whitespace is coming from in theoutput. The reason is simple: it was in the input, and we don't match on itanywhere, so it gets output again.

The Flex manpage documents its regular expressions in detail. Many peoplefeel that the perl regular expression manpage (perlre) is also very useful,although Flex does not implement everything perl does.

Make sure that you do not create zero length matches like '[0-9]*' - yourlexer might get confused and start matching empty strings repeatedly.

3.2 A more complicated example for a C like syntax

Let's say we want to parse a file that looks like this:

logging {
category lame-servers { null; };
category cname { null; };
};

zone "." {
type hint;
file "/etc/bind/db.root";
};

We clearly see a number of categories (tokens) in this file:

  • WORDs, like 'zone' and 'type'
  • FILENAMEs, like '/etc/bind/db.root'
  • QUOTEs, like those surrounding the filename
  • OBRACEs, {
  • EBRACEs, }
  • SEMICOLONs, ;

The corresponding Lex file is Example 3:

%{
#include
%}

%%
[a-zA-Z][a-zA-Z0-9]* printf("WORD ");
[a-zA-Z0-9\/.-]+ printf("FILENAME ");
\" printf("QUOTE ");
\{ printf("OBRACE ");
\} printf("EBRACE ");
; printf("SEMICOLON ");
\n printf("\n");
[ \t]+ /* ignore whitespace */;
%%

When we feed our file to the program this Lex file generates (usingexample3.compile), we get:

WORD OBRACE 
WORD FILENAME OBRACE WORD SEMICOLON EBRACE SEMICOLON
WORD WORD OBRACE WORD SEMICOLON EBRACE SEMICOLON
EBRACE SEMICOLON

WORD QUOTE FILENAME QUOTE OBRACE
WORD WORD SEMICOLON
WORD QUOTE FILENAME QUOTE SEMICOLON
EBRACE SEMICOLON

When compared with the configuration file mentioned above, it is clear thatwe have neatly 'Tokenized' it. Each part of the configuration file has beenmatched, and converted into a token.

And this is exactly what we need to put YACC to good use.

3.3 What we've seen

We've seen that Lex is able to read arbitrary input, and determine what eachpart of the input is. This is called 'Tokenizing'.

4. YACC

YACC can parse input streams consisting of tokens with certain values. Thisclearly describes the relation YACC has with Lex, YACC has no ideawhat 'input streams' are, it needs preprocessed tokens. While you can write yourown Tokenizer, we will leave that entirely up to Lex.

A note on grammars and parsers. When YACC saw the light of day, the tool wasused to parse input files for compilers: programs. Programs written in aprogramming language for computers are typically *not* ambiguous - they havejust one meaning. As such, YACC does not cope with ambiguity and willcomplain about shift/reduce or reduce/reduce conflicts. More aboutambiguity and YACC "problems" can be found in 'Conflicts' chapter.

4.1 A simple thermostat controller

Let's say we have a thermostat that we want to control using a simplelanguage. A session with the thermostat may look like this:

heat on
Heater on!
heat off
Heater off!
target temperature 22
New temperature set!

The tokens we need to recognize are: heat, on/off (STATE), target, temperature,NUMBER.

The Lex tokenizer (Example 4) is:

%{
#include
#include "y.tab.h"
%}
%%
[0-9]+ return NUMBER;
heat return TOKHEAT;
on|off return STATE;
target return TOKTARGET;
temperature return TOKTEMPERATURE;
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%

We note two important changes. First, we include the file 'y.tab.h', andsecondly, we no longer print stuff, we return names of tokens. This changeis because we are now feeding it all to YACC, which isn't interested inwhat we output to the screen. Y.tab.h has definitions for these tokens.

But where does y.tab.h come from? It is generated by YACC from the GrammarFile we are about to create. As our language is very basic, so is the grammar:

commands: /* empty */
| commands command
;

command:
heat_switch
|
target_set
;

heat_switch:
TOKHEAT STATE
{
printf("\tHeat turned on or off\n");
}
;

target_set:
TOKTARGET TOKTEMPERATURE NUMBER
{
printf("\tTemperature set\n");
}
;

The first part is what I call the 'root'. It tells us that wehave 'commands', and that these commands consist of individual 'command'parts. As you can see this rule is very recursive, because it againcontains the word 'commands'. What this means is that the program is nowcapable of reducing a series of commands one by one. Read the chapter 'Howdo Lex and YACC work internally' for important details on recursion.

The second rule defines what a command is. We support only two kinds ofcommands, the 'heat_switch' and the 'target_set'. This is what the |-symbolsignifies - 'a command consists of either a heat_switch or a target_set'.

A heat_switch consists of the HEAT token, which is simply the word 'heat',followed by a state (which we defined in the Lex file as 'on' or 'off').

Somewhat more complicated is the target_set, which consists of the TARGETtoken (the word 'target'), the TEMPERATURE token (the word 'temperature')and a number.

A complete YACC file

The previous section only showed the grammar part of the YACC file, butthere is more. This is the header that we omitted:

%{
#include
#include

void yyerror(const char *str)
{
fprintf(stderr,"error: %s\n",str);
}

int yywrap()
{
return 1;
}

main()
{
yyparse();
}

%}

%token NUMBER TOKHEAT STATE TOKTARGET TOKTEMPERATURE
The yyerror() function is called by YACC if it finds an error. We simplyoutput the message passed, but there are smarter things to do. Seethe 'Further reading' section at the end.

The function yywrap() can be used to continue reading from another file. Itis called at EOF and you can than open another file, and return 0. Or youcan return 1, indicating that this is truly the end. For more about this,see the 'How do Lex and YACC work internally' chapter.

Then there is the main() function, that does nothing but set everything inmotion.

The last line simply defines the tokens we will be using. These are outputusing y.tab.h if YACC is invoked with the '-d' option.

Compiling & running the thermostat controller

lex example4.l
yacc -d example4.y
cc lex.yy.c y.tab.c -o example4
A few things have changed. We now also invoke YACC to compile our grammar,which creates y.tab.c and y.tab.h. We then call Lex as usual. Whencompiling, we remove the -ll flag: we now have our own main() function anddon't need the one provided by libl.

NOTE: if you get an error about your compiler not being able tofind 'yylval', add this to example4.l, just beneath #include:
extern YYSTYPE yylval;
This is explained in the 'How Lex and YACC work internally' section.

A sample session:

$ ./example4 
heat on
Heat turned on or off
heat off
Heat turned on or off
target temperature 10
Temperature set
target humidity 20
error: parse error
$

This is not quite what we set out to achieve, but in the interest of keepingthe learning curve manageable, not all cool stuff can be presented at once.

4.2 Expanding the thermostat to handle parameters

As we've seen, we now parse the thermostat commands correctly, and even flagmistakes properly. But as you might have guessed by the weasely wording, theprogram has no idea of what it should do, it does not get passed any of thevalues you enter.

Let's start by adding the ability to read the new target temperature. Inorder to do so, we need to learn the NUMBER match in the Lexer to convertitself into an integer value, which can then be read in YACC.

Whenever Lex matches a target, it puts the text of the match in thecharacter string 'yytext'. YACC in turn expects to find a value in thevariable 'yylval'. In Example 5, we see the obvious solution:

%{
#include
#include "y.tab.h"
%}
%%
[0-9]+ yylval=atoi(yytext); return NUMBER;
heat return TOKHEAT;
on|off yylval=!strcmp(yytext,"on"); return STATE;
target return TOKTARGET;
temperature return TOKTEMPERATURE;
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%

As you can see, we run atoi() on yytext, and put the result in yylval, whereYACC can see it. We do much the same for the STATE match, where we compareit to 'on', and set yylval to 1 if it is equal. Please note that having aseparate 'on' and 'off' match in Lex would produce faster code, but I wantedto show a more complicated rule and action for a change.

Now we need to learn YACC how to deal with this. What is called 'yylval' inLex has a different name in YACC. Let's examine the rule setting the newtemperature target:

target_set: 
TOKTARGET TOKTEMPERATURE NUMBER
{
printf("\tTemperature set to %d\n",$3);
}
;

To access the value of the third part of the rule (ie, NUMBER), we need touse $3. Whenever yylex() returns, the contents of yylval are attached to theterminal, the value of which can be accessed with the $-construct.

To expound on this further, let's observe the new 'heat_switch' rule:

heat_switch:
TOKHEAT STATE
{
if($2)
printf("\tHeat turned on\n");
else
printf("\tHeat turned off\n");
}
;

If you now run example5, it properly outputs what you entered.

4.3 Parsing a configuration file

Let's repeat part of the configuration file we mentioned earlier:

zone "." {
type hint;
file "/etc/bind/db.root";
};

Remember that we already wrote a Lexer for this file. Now all we need to dois write the YACC grammar, and modify the Lexer so it returns values ina format YACC can understand.

In the lexer from Example 6 we see:

%{
#include
#include "y.tab.h"
%}

%%

zone return ZONETOK;
file return FILETOK;
[a-zA-Z][a-zA-Z0-9]* yylval=strdup(yytext); return WORD;
[a-zA-Z0-9\/.-]+ yylval=strdup(yytext); return FILENAME;
\" return QUOTE;
\{ return OBRACE;
\} return EBRACE;
; return SEMICOLON;
\n /* ignore EOL */;
[ \t]+ /* ignore whitespace */;
%%

If you look carefully, you can see that yylval has changed! We no longerexpect it to be an integer, but in fact assume that it is a char *. In theinterest of keeping things simple, we invoke strdup and waste a lot ofmemory. Please note that this may not be a problem in many areas where youonly need to parse a file once, and then exit.

We want to store character strings because we are now mostly dealing withnames: file names and zone names. In a later chapter we will explain how todeal with multiple types of data.

In order to tell YACC about the new type of yylval, we add this line to theheader of our YACC grammar:

#define YYSTYPE char *

The grammar itself is again more complicated. We chop it in parts to make iteasier to digest.

commands:
|
commands command SEMICOLON
;


command:
zone_set
;

zone_set:
ZONETOK quotedname zonecontent
{
printf("Complete zone for '%s' found\n",$2);
}
;

This is the intro, including the aforementioned recursive 'root'. Pleasenote that we specify that commands are terminated (and separated) by ;'s. Wedefine one kind of command, the 'zone_set'. It consists of the ZONE token(the word 'zone'), followed by a quoted name and the 'zonecontent'. Thiszonecontent starts out simple enough:

zonecontent:
OBRACE zonestatements EBRACE

It needs to start with an OBRACE, a {. Then follow the zonestatements,followed by an EBRACE, }.

quotedname:
QUOTE FILENAME QUOTE
{
$$=$2;
}

This section defines what a 'quotedname' is: a FILENAME between QUOTEs.Then it says something special: the value of a quotedname token is the valueof the FILENAME. This means that the quotedname has as its value thefilename without quotes.

This is what the magic '$$=$2;' command does. It says: my value is the valueof my second part. When the quotedname is now referenced in other rules, andyou access its value with the $-construct, you see the value that we sethere with $$=$2.

NOTE: this grammar chokes on filenames without either a '.' or a '/' inthem.

zonestatements:
|
zonestatements zonestatement SEMICOLON
;

zonestatement:
statements
|
FILETOK quotedname
{
printf("A zonefile name '%s' was encountered\n", $2);
}
;

This is a generic statement that catches all kinds of statements withinthe 'zone' block. We again see the recursiveness.

block: 
OBRACE zonestatements EBRACE SEMICOLON
;

statements:
| statements statement
;

statement: WORD | block | quotedname

This defines a block, and 'statements' which may be found within.

When executed, the output is like this:

$ ./example6
zone "." {
type hint;
file "/etc/bind/db.root";
type hint;
};
A zonefile name '/etc/bind/db.root' was encountered
Complete zone for '.' found

5. Making a Parser in C++

Although Lex and YACC predate C++, it is possible to generate a C++ parser.While Flex includes an option to generate a C++ lexer, we won't be usingthat, as YACC doesn't know how to deal with it directly.

My preferred way to make a C++ parser is to have Lex generate a plain Cfile, and to let YACC generate C++ code. When you then link yourapplication, you may run into some problems because the C++ code by defaultwon't be able to find C functions, unless you've told it that thosefunctions are extern "C".

To do so, make a C header in YACC like this:

extern "C"
{
int yyparse(void);
int yylex(void);
int yywrap()
{
return 1;
}

}

If you want to declare or change yydebug, you must now do it like this:

extern int yydebug;

main()
{
yydebug=1;
yyparse();
}

This is because C++'s One Definition Rule, which disallows multipledefinitions of yydebug.

You may also find that you need to repeat the #define of YYSTYPE in your Lexfile, because of C++'s stricter type checking.

To compile, do something like this:

lex bindconfig2.l
yacc --verbose --debug -d bindconfig2.y -o bindconfig2.cc
cc -c lex.yy.c -o lex.yy.o
c++ lex.yy.o bindconfig2.cc -o bindconfig2

Because of the -o statement, y.tab.h is now called bindconfig2.cc.h, so takethat into account.

To summarize: don't bother to compile your Lexer in C++, keep it in C. Makeyour Parser in C++ and explain your compiler that some functions are Cfunctions with extern "C" statements.

6. How do Lex and YACC work internally

In the YACC file, you write your own main() function, which calls yyparse()at one point. The function yyparse() is created for you by YACC, and ends upin y.tab.c.

yyparse() reads a stream of token/value pairs from yylex(), which needs tobe supplied. You can code this function yourself, or have Lex do it for you.In our examples, we've chosen to leave this task to Lex.

The yylex() as written by Lex reads characters from a FILE * file pointercalled yyin. If you do not set yyin, it defaults to standard input. Itoutputs to yyout, which if unset defaults to stdout. You can also modifyyyin in the yywrap() function which is called at the end of a file. Itallows you to open another file, and continue parsing.

If this is the case, have it return 0. If you want to end parsing at thisfile, let it return 1.

Each call to yylex() returns an integer value which represents a token type.This tells YACC what kind of token it has read. The token may optionallyhave a value, which should be placed in the variable yylval.

By default yylval is of type int, but you can override that from the YACCfile by re#defining YYSTYPE.

The Lexer needs to be able to access yylval. In order to do so, it must bedeclared in the scope of the lexer as an extern variable. The original YACCneglects to do this for you, so you should add the following to your lexter,just beneath #include :

extern YYSTYPE yylval;

Bison, which most people are using these days, does this for youautomatically.

6.1 Token values

As mentioned before, yylex() needs to return what kind of token itencountered, and put its value in yylval. When these tokens are defined withthe %token command, they are assigned numerical id's, starting from 256.

Because of that fact, it is possible to have all ascii characters as atoken. Let's say you are writing a calculator, up till now we would havewritten the lexer like this:

[0-9]+          yylval=atoi(yytext); return NUMBER;
[ \n]+ /* eat whitespace */;
- return MINUS;
\* return MULT;
\+ return PLUS;
...

Our YACC grammer would then contain:

        exp:    NUMBER 
|
exp PLUS exp
|
exp MINUS exp
|
exp MULT exp

This is needlessly complicated. By using characters as shorthands fornumerical token id's, we can rewrite our lexer like this:

[0-9]+          yylval=atoi(yytext); return NUMBER;
[ \n]+ /* eat whitespace */;
. return (int) yytext[0];

This last dot matches all single otherwise unmatched characters.

Our YACC grammer would then be:

        exp:    NUMBER 
|
exp '+' exp
|
exp '-' exp
|
exp '*' exp

This is lots shorter and also more obvious. You do not need to declare theseascii tokens with %token in the header, they work out of the box.

One other very good thing about this construct is that Lex will now matcheverything we throw at it - avoiding the default behaviour of echoingunmatched input to standard output. If a user of this calculator uses a ^,for example, it will now generate a parsing error, instead of being echoedto standard output.

6.2 Recursion: 'right is wrong'

Recursion is a vital aspect of YACC. Without it, you can't specify that afile consists of a sequence of independent commands or statements. Out ofits own accord, YACC is only interested in the first rule, or the one youdesignate as the starting rule, with the '%start' symbol.

Recursion in YACC comes in two flavours: right and left. Left recursion,which is the one you should use most of the time, looks like this:

commands: /* empty */
|
commands command
This says: a command is either empty, or it consists of more commands,followed by a command. They way YACC works means that it can now easily chopoff individual command groups (from the front) and reduce them.

Compare this to right recursion, which confusingly enough looks better tomany eyes:

commands: /* empty */
|
command commands
But this is expensive. If used as the %start rule, it requires YACC to keepall commands in your file on the stack, which may take a lot of memory. Soby all means, use left recursion when parsing long statements, like entirefiles. Sometimes it is hard to avoid right recursion but if your statementsare not too long, you do not need to go out of your way to use leftrecursion.

If you have something terminating (and therefore separating) your commands,right recursion looks very natural, but is still expensive:

commands: /* empty */
|
command SEMICOLON commands

The right way to code this is using left recursion (I didn't invent thiseither):

commands: /* empty */
|
commands command SEMICOLON

Earlier versions of this HOWTO mistakenly used right recursion. MarkusTriska kindly informed us of this.

6.3 Advanced yylval: %union

Currently, we need to define *the* type of yylval. This however is notalways appropriate. There will be times when we need to be able to handlemultiple data types. Returning to our hypothetical thermostat, perhaps wewant to be able to choose a heater to control, like this:

heater mainbuiling
Selected 'mainbuilding' heater
target temperature 23
'mainbuilding' heater target temperature now 23

What this calls for is for yylval to be a union, which can hold both stringsand integers - but not simultaneously.

Remember that we told YACC previously what type yylval was supposed to by bydefining YYSTYPE. We could conceivably define YYSTYPE to be a union thisway, by YACC has an easier method for doing this: the %union statement.

Based on Example 4, we now write the Example 7 YACC grammar. First theintro:

%token TOKHEATER TOKHEAT TOKTARGET TOKTEMPERATURE

%union
{
int number;
char *string;
}

%token STATE
%token NUMBER
%token WORD

We define our union, which contains only a number and a string. Then usingan extended %token syntax, we explain to YACC which part of the union eachtoken should access.

In this case, we let the STATE token use an integer, as before. Same goesfor the NUMBER token, which we use for reading temperatures.

New however is the WORD token, which is declared to need a string.

The Lexer file changes a bit too:

%{
#include
#include
#include "y.tab.h"
%}
%%
[0-9]+ yylval.number=atoi(yytext); return NUMBER;
heater return TOKHEATER;
heat return TOKHEAT;
on|off yylval.number=!strcmp(yytext,"on"); return STATE;
target return TOKTARGET;
temperature return TOKTEMPERATURE;
[a-z0-9]+ yylval.string=strdup(yytext);return WORD;
\n /* ignore end of line */;
[ \t]+ /* ignore whitespace */;
%%

As you can see, we don't access the yylval directly anymore, we add a suffixindicating which part we want to access. We don't need to do that in theYACC grammar however, as YACC performs the magic for us:

heater_select:
TOKHEATER WORD
{
printf("\tSelected heater '%s'\n",$2);
heater=$2;
}
;

Because of the %token declaration above, YACC automatically picksthe 'string' member from our union. Note also that we store a copy of $2,which is later used to tell the user which heater he is sending commands to:

target_set:
TOKTARGET TOKTEMPERATURE NUMBER
{
printf("\tHeater '%s' temperature set to %d\n",heater,$3);
}
;

For more details, read example7.y.

7. Debugging

Especially when learning, it is important to have debugging facilities.Luckily, YACC can give a lot of feedback. This feedback comes at the cost ofsome overhead, so you need to supply some switches to enable it.

When compiling your grammar, add --debug and --verbose to the YACCcommandline. In your grammar C heading, add the following:

int yydebug=1;

This will generate the file 'y.output' which explains the state machine thatwas created.

When you now run the generated binary, it will output a *lot* of what ishappening. This includes what state the state machine currently has, andwhat tokens are being read.

Peter Jinks wrote a page ondebugging whichcontains some common errors and how to solve them.

7.1 The state machine

Internally, your YACC parser runs a so called 'state machine'. As the nameimplies, this is a machine that can be in several states. Then there arerules which govern transitions from one state to another. Everything startswith the so called 'root' rule I mentioned earlier.

To quote from the output from the Example 7 y.output:

state 0

ZONETOK , and go to state 1

$default reduce using rule 1 (commands)

commands go to state 29
command go to state 2
zone_set go to state 3

By default, this state reduces using the 'commands' rule. This is theaforementioned recursive rule that defines 'commands' to be built up fromindividual command statements, followed by a semicolon, followed by possiblymore commands.

This state reduces until it hits something it understands, in this case, aZONETOK, ie, the word 'zone'. It then goes to state 1, which deals furtherwith a zone command:

state 1

zone_set -> ZONETOK . quotedname zonecontent (rule 4)

QUOTE , and go to state 4

quotedname go to state 5

The first line has a '.' in it to indicate where we are: we've just seen aZONETOK and are now looking for a 'quotedname'. Apparently, a quotednamestarts with a QUOTE, which sends us to state 4.

To follow this further, compile Example 7 with the flags mentioned in theDebugging section.

7.2 Conflicts: 'shift/reduce', 'reduce/reduce'

Whenever YACC warns you about conflicts, you may be in for trouble. Solvingthese conflicts appears to be somewhat of an art form that may teach you alot about your language. More than you possibly would have wanted to know.

The problems revolve around how to interpret a sequence of tokens. Let'ssuppose we define a language that needs to accept both these commands:

        delete heater all
delete heater number1

To do this, we define this grammar:

        delete_heaters:
TOKDELETE TOKHEATER mode
{
deleteheaters($3);
}

mode: WORD

delete_a_heater:
TOKDELETE TOKHEATER WORD
{
delete($3);
}

You may already be smelling trouble. The state machine starts by reading theword 'delete', and then needs to decide where to go based on the next token.This next token can either be a mode, specifying how to delete the heaters,or the name of a heater to delete.

The problem however is that for both commands, the next token is going to bea WORD. YACC has therefore no idea what to do. This leads toa 'reduce/reduce' warning, and a further warning that the 'delete_a_heater'node is never going to be reached.

In this case the conflict is resolved easily (ie, by renaming the firstcommand to 'delete heaters all', or by making 'all' a separate token), butsometimes it is harder. The y.output file generated when you pass yacc the--verbose flag can be of tremendous help.

8. Further reading

GNU YACC (Bison) comes with a very nice info-file (.info) which documentsthe YACC syntax very well. It mentions Lex only once, but otherwise it'svery good. You can read .info files with Emacs or with the very nicetool 'pinfo'. It is also available on the GNU site:BISON Manual.

Flex comes with a good manpage which is very useful if you alreadyhave a rough understanding of what Flex does. TheFlex Manual is alsoavailable online.

After this introduction to Lex and YACC, you may find that you need moreinformation. I haven't read any of these books yet, but they sound good:

Bison-The Yacc-Compatible Parser Generator

By Charles Donnelly and Richard Stallman. AnAmazonuser found it useful.

Lex & Yacc

By John R. Levine, Tony Mason and Doug Brown. Considered to be the standardwork on this subject, although a bit dated. Reviews over atAmazon.

Compilers : Principles, Techniques, and Tools

By Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman. The 'Dragon Book'. From1985 and they just keep printing it. Considered the standard work onconstructing compilers.Amazon

Thomas Niemann wrote a document discussing how to write compilers andcalculators with Lex & YACC. You can find ithere.

The moderated usenet newsgroup comp.compilers can also be very useful butplease keep in mind that the people there are not a dedicated parserhelpdesk! Before posting, read their interestingpage and especially theFAQ.

Lex - A Lexical Analyzer Generator by M. E. Lesk and E. Schmidt is one ofthe original reference papers. It can be foundhere.

Yacc: Yet Another Compiler-Compiler by Stephen C. Johnson is one of theoriginal reference papers for YACC. It can be foundhere.It contains useful hints on style.

9. Acknowledgements & Thanks

  • Pete Jinks
  • Chris Lattner
  • John W. Millaway
  • Martin Neitzel
  • Sumit Pandaya
  • Esmond Pitt
  • Eric S. Raymond
  • Bob Schmertz
  • Adam Sulmicki
  • Markus Triska
  • Erik Verbruggen
  • Gary V. Vaughan (read his awesome Autobook)
  • Ivo van der Wijk ( Amaze Internet)