Lex - A Lexical Analyzer Generator

来源:百度文库 编辑:神马文学网 时间:2024/04/28 12:23:23
The Lex & Yacc Page

Lex - A Lexical Analyzer Generator

M. E. Lesk and E. Schmidt


ABSTRACT

Lex helps write programs whose control flow is directed byinstances of regular expressions in the input stream. It is wellsuited for editor-script type transformations and for segmentinginput in preparation for a parsing routine.

Lex source is a table of regular expressions and correspondingprogram fragments. The table is translated to a programwhich reads an input stream, copying it to an output stream andpartitioning the input into strings which match the given expressions.As each such string is recognized the corresponding programfragment is executed. The recognition of the expressions isperformed by a deterministic finite automaton generated by Lex.The program fragments written by the user are executed in theorder in which the corresponding regular expressions occur in theinput stream.

The lexical analysis programs written with Lex accept ambiguousspecifications and choose the longest match possible ateach input point. If necessary, substantial lookahead is performedon the input, but the input stream will be backed up tothe end of the current partition, so that the user has generalfreedom to manipulate it.

Lex can generate analyzers in either C or Ratfor, a languagewhich can be translated automatically to portable Fortran. It isavailable on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems.This manual, however, will only discuss generating analyzers in Con the UNIX system, which is the only supported form of Lex underUNIX Version 7. Lex is designed to simplify interfacing withYacc, for those with access to this compiler-compiler system.

1. Introduction.

Lex is a program generator designed for lexical processing ofcharacter input streams. It accepts a high-level, problemoriented specification for character string matching, and producesa program in a general purpose language which recognizes regularexpressions. The regular expressions are specified by the user inthe source specifications given to Lex. The Lex written coderecognizes these expressions in an input stream and partitions theinput stream into strings matching the expressions. At theboundaries between strings program sections provided by the userare executed. The Lex source file associates the regularexpressions and the program fragments. As each expression appearsin the input to the program written by Lex, the correspondingfragment is executed.

The user supplies the additional code beyond expressionmatching needed to complete his tasks, possibly including codewritten by other generators. The program that recognizes theexpressions is generated in the general purpose programminglanguage employed for the user's program fragments. Thus, a highlevel expression language is provided to write the stringexpressions to be matched while the user's freedom to writeactions is unimpaired. This avoids forcing the user who wishes touse a string manipulation language for input analysis to writeprocessing programs in the same and often inappropriate stringhandling language.

Lex is not a complete language, but rather a generatorrepresenting a new language feature which can be added todifferent programming languages, called ``host languages.'' Justas general purpose languages can produce code to run on differentcomputer hardware, Lex can write code in different host languages.The host language is used for the output code generated by Lex andalso for the program fragments added by the user. Compatiblerun-time libraries for the different host languages are alsoprovided. This makes Lex adaptable to different environments anddifferent users. Each application may be directed to thecombination of hardware and host language appropriate to the task,the user's background, and the properties of local implementations.At present, the only supported host language is C,although Fortran (in the form of Ratfor [2] has been available inthe past. Lex itself exists on UNIX, GCOS, and OS/370; but thecode generated by Lex may be taken anywhere the appropriatecompilers exist.

Lex turns the user's expressions and actions (called sourcein this memo) into the host general-purpose language; thegenerated program is named yylex. The yylex program willrecognize expressions in a stream (called input in this memo) andperform the specified actions for each expression as it isdetected. See Figure 1.

                                  +-------+
Source -> | Lex | -> yylex
+-------+

+-------+
Input -> | yylex | -> Output
+-------+

An overview of Lex
Figure 1
For a trivial example, consider a program to delete from theinput all blanks or tabs at the ends of lines.
                                 %%
[ \t]+$ ;
is all that is required. The program contains a %% delimiter tomark the beginning of the rules, and one rule. This rule containsa regular expression which matches one or more instances of thecharacters blank or tab (written \t for visibility, in accordancewith the C language convention) just prior to the end of a line.The brackets indicate the character class made of blank and tab;the + indicates ``one or more ...''; and the $ indicates ``endof line,'' as in QED. No action is specified, so the programgenerated by Lex (yylex) will ignore these characters. Everythingelse will be copied. To change any remaining string of blanks ortabs to a single blank, add another rule:
                           %%
[ \t]+$ ;
[ \t]+ printf(" ");
The finite automaton generated for this source will scan for bothrules at once, observing at the termination of the string ofblanks or tabs whether or not there is a newline character, andexecuting the desired rule action. The first rule matches allstrings of blanks or tabs at the end of lines, and the second ruleall remaining strings of blanks or tabs.

Lex can be used alone for simple transformations, or foranalysis and statistics gathering on a lexical level. Lex canalso be used with a parser generator to perform the lexicalanalysis phase; it is particularly easy to interface Lex and Yacc[3]. Lex programs recognize only regular expressions; Yacc writesparsers that accept a large class of context free grammars, butrequire a lower level analyzer to recognize input tokens. Thus, acombination of Lex and Yacc is often appropriate. When used asa preprocessor for a later parser generator, Lex is used topartition the input stream, and the parser generator assignsstructure to the resulting pieces. The flow of control in sucha case (which might be the first half of a compiler, for example)is shown in Figure 2. Additional programs, written by othergenerators or by hand, can be added easily to programs written byLex.

                        lexical        grammar
rules rules
| |
v v
+---------+ +---------+
| Lex | | Yacc |
+---------+ +---------+
| |
v v
+---------+ +---------+
Input -> | yylex | -> | yyparse | -> Parsed input
+---------+ +---------+


Lex with Yacc
Figure 2
Yacc users will realize that the name yylex is what Yacc expectsits lexical analyzer to be named, so that the use of this name byLex simplifies interfacing.

Lex generates a deterministic finite automaton from theregular expressions in the source [4]. The automaton isinterpreted, rather than compiled, in order to save space. Theresult is still a fast analyzer. In particular, the time taken bya Lex program to recognize and partition an input stream isproportional to the length of the input. The number of Lex rulesor the complexity of the rules is not important in determiningspeed, unless rules which include forward context require asignificant amount of rescanning. What does increase with thenumber and complexity of rules is the size of the finiteautomaton, and therefore the size of the program generated by Lex.

In the program written by Lex, the user's fragments(representing the actions to be performed as each regularexpression is found) are gathered as cases of a switch. Theautomaton interpreter directs the control flow. Opportunity isprovided for the user to insert either declarations or additionalstatements in the routine containing the actions, or to addsubroutines outside this action routine.

Lex is not limited to source which can be interpreted on thebasis of one character lookahead. For example, if there are tworules, one looking for ab and another for abcdefg, and the inputstream is abcdefh, Lex will recognize ab and leave the inputpointer just before cd. . . Such backup is more costly than theprocessing of simpler languages.

2. Lex Source.

The general format of Lex source is:
                             {definitions}
%%
{rules}
%%
{user subroutines}
where the definitions and the user subroutines are often omitted.The second %% is optional, but the first is required to mark thebeginning of the rules. The absolute minimum Lex program is thus
                                     %%
(no definitions, no rules) which translates into a program whichcopies the input to the output unchanged.

In the outline of Lex programs shown above, the rulesrepresent the user's control decisions; they are a table, in whichthe left column contains regular expressions (see section 3) andthe right column contains actions, program fragments to beexecuted when the expressions are recognized. Thus an individualrule might appear

                   integer   printf("found keyword INT");
to look for the string integer in the input stream and print themessage ``found keyword INT'' whenever it appears. In thisexample the host procedural language is C and the C libraryfunction printf is used to print the string. The end of theexpression is indicated by the first blank or tab character. Ifthe action is merely a single C expression, it can just be givenon the right side of the line; if it is compound, or takes morethan a line, it should be enclosed in braces. As a slightly moreuseful example, suppose it is desired to change a number of wordsfrom British to American spelling. Lex rules such as
                      colour      printf("color");
mechanise printf("mechanize");
petrol printf("gas");
would be a start. These rules are not quite enough, since theword petroleum would become gaseum; a way of dealing with thiswill be described later.

3. Lex Regular Expressions.

The definitions of regular expressions are very similar tothose in QED [5]. A regular expression specifies a set of stringsto be matched. It contains text characters (which match thecorresponding characters in the strings being compared) andoperator characters (which specify repetitions, choices, and otherfeatures). The letters of the alphabet and the digits are alwaystext characters; thus the regular expression
                                 integer
matches the string integer wherever it appears and the expression
                                    a57D
looks for the string a57D.

Operators. The operator characters are

                   " \ [ ] ^ - ? . * + | ( ) $ / { } % < >
and if they are to be used as text characters, an escape should beused. The quotation mark operator (") indicates that whatever iscontained between a pair of quotes is to be taken as textcharacters. Thus
                                   xyz"++"
matches the string xyz++ when it appears. Note that a part of astring may be quoted. It is harmless but unnecessary to quote anordinary text character; the expression
                                   "xyz++"
is the same as the one above. Thus by quoting everynon-alphanumeric character being used as a text character, theuser can avoid remembering the list above of current operatorcharacters, and is safe should further extensions to Lex lengthenthe list.

An operator character may also be turned into a textcharacter by preceding it with \ as in

                                   xyz\+\+
which is another, less readable, equivalent of the aboveexpressions. Another use of the quoting mechanism is to get ablank into an expression; normally, as explained above, blanks ortabs end a rule. Any blank character not contained within [] (seebelow) must be quoted. Several normal C escapes with \ arerecognized: \n is newline, \t is tab, and \b is backspace. Toenter \ itself, use \\. Since newline is illegal in anexpression, \n must be used; it is not required to escape tab andbackspace. Every character but blank, tab, newline and the listabove is always a text character.

Character classes. Classes of characters can be specifiedusing the operator pair []. The construction [abc] matches asingle character, which may be a, b, or c. Within squarebrackets, most operator meanings are ignored. Only threecharacters are special: these are \ - and ^. The - characterindicates ranges. For example,

                                 [a-z0-9<>_]
indicates the character class containing all the lower caseletters, the digits, the angle brackets, and underline. Rangesmay be given in either order. Using - between any pair ofcharacters which are not both upper case letters, both lower caseletters, or both digits is implementation dependent and will get awarning message. (E.g., [0-z] in ASCII is many more charactersthan it is in EBCDIC). If it is desired to include the character- in a character class, it should be first or last; thus
                                   [-+0-9]
matches all the digits and the two signs.

In character classes, the ^ operator must appear as the firstcharacter after the left bracket; it indicates that the resultingstring is to be complemented with respect to the computercharacter set. Thus

                                   [^abc]
matches all characters except a, b, or c, including all special orcontrol characters; or
                                  [^a-zA-Z]
is any character which is not a letter. The \ character providesthe usual escapes within character class brackets.

Arbitrary character. To match almost any character, theoperator character.is the class of all characters except newline. Escaping intooctal is possible although non-portable:

                                 [\40-\176]
matches all printable characters in the ASCII character set, fromoctal 40 (blank) to octal 176 (tilde).

Optional expressions. The operator ? indicates an optionalelement of an expression. Thus

                                    ab?c
matches either ac or abc.

Repeated expressions. Repetitions of classes are indicatedby the operators * and +.

                                     a*
is any number of consecutive a characters, including zero; while
                                     a+
is one or more instances of a. For example,
                                   [a-z]+
is all strings of lower case letters. And
                            [A-Za-z][A-Za-z0-9]*
indicates all alphanumeric strings with a leading alphabeticcharacter. This is a typical expression for recognizingidentifiers in computer languages.

Alternation and Grouping. The operator | indicatesalternation:

                                   (ab|cd)
matches either ab or cd. Note that parentheses are used forgrouping, although they are not necessary on the outside level;
                                    ab|cd
would have sufficed. Parentheses can be used for more complexexpressions:
                               (ab|cd+)?(ef)*
matches such strings as abefef, efefef, cdef, or cddd; but notabc, abcd, or abcdef.

Context sensitivity. Lex will recognize a small amount ofsurrounding context. The two simplest operators for this are ^and $. If the first character of an expression is ^, theexpression will only be matched at the beginning of a line (aftera newline character, or at the beginning of the input stream).This can never conflict with the other meaning of ^, complementationof character classes, since that only applies within the[] operators. If the very last character is $, the expressionwill only be matched at the end of a line (when immediatelyfollowed by newline). The latter operator is a special case ofthe / operator character, which indicates trailing context. Theexpression

                                    ab/cd
matches the string ab, but only if followed by cd. Thus
                                     ab$
is the same as
                                    ab/\n
Left context is handled in Lex by start conditions as explained insection 10. If a rule is only to be executed when the Lexautomaton interpreter is in start condition x, the rule should beprefixed by
                                     
using the angle bracket operator characters. If we considered``being at the beginning of a line'' to be start condition ONE,then the ^ operator would be equivalent to
                                    
Start conditions are explained more fully later.

Repetitions and Definitions. The operators {} specify eitherrepetitions (if they enclose numbers) or definition expansion(if they enclose a name). For example

                                   {digit}
looks for a predefined string named digit and inserts it at thatpoint in the expression. The definitions are given in the firstpart of the Lex input, before the rules. In contrast,
                                   a{1,5}
looks for 1 to 5 occurrences of a.

Finally, initial % is special, being the separator for Lexsource segments.

4. Lex Actions.

When an expression written as above is matched, Lex executesthe corresponding action. This section describes some features ofLex which aid in writing actions. Note that there is a defaultaction, which consists of copying the input to the output. Thisis performed on all strings not otherwise matched. Thus the Lexuser who wishes to absorb the entire input, without producing anyoutput, must provide rules to match everything. When Lex is beingused with Yacc, this is the normal situation. One may considerthat actions are what is done instead of copying the input to theoutput; thus, in general, a rule which merely copies can beomitted. Also, a character combination which is omitted fromthe rules and which appears as input is likely to be printed onthe output, thus calling attention to the gap in the rules.

One of the simplest things that can be done is to ignore theinput. Specifying a C null statement, ; as an action causes thisresult. A frequent rule is

                                 [ \t\n]   ;
which causes the three spacing characters (blank, tab, andnewline) to be ignored.

Another easy way to avoid writing actions is the actioncharacter |, which indicates that the action for this rule is theaction for the next rule. The previous example could also havebeen written

                                   " "
"\t"
"\n"
with the same result, although in different style. The quotesaround \n and \t are not required.

In more complex actions, the user will often want to know theactual text that matched some expression like [a-z]+. Lex leavesthis text in an external character array named yytext. Thus, toprint the name found, a rule like

                       [a-z]+   printf("%s", yytext);
will print the string in yytext. The C function printf accepts aformat argument and data to be printed; in this case, the formatis ``print string'' (% indicating data conversion, and sindicating string type), and the data are the characters inyytext. So this just places the matched string on the output.This action is so common that it may be written as ECHO:
                               [a-z]+   ECHO;
is the same as the above. Since the default action is just toprint the characters found, one might ask why give a rule, likethis one, which merely specifies the default action? Such rulesare often required to avoid matching some other rule which isnot desired. For example, if there is a rule which matches readit will normally match the instances of read contained in bread orreadjust; to avoid this, a rule of the form [a-z]+ is needed.This is explained further below.

Sometimes it is more convenient to know the end of what hasbeen found; hence Lex also provides a count yyleng of the numberof characters matched. To count both the number of words and thenumber of characters in words in the input, the user might write[a-zA-Z]+ {words++; chars += yyleng;}which accumulates in chars the number of characters in the wordsrecognized. The last character in the string matched can beaccessed by

                              yytext[yyleng-1]

Occasionally, a Lex action may decide that a rule has notrecognized the correct span of characters. Two routines areprovided to aid with this situation. First, yymore() can becalled to indicate that the next input expression recognized is tobe tacked on to the end of this input. Normally, the next inputstring would overwrite the current entry in yytext. Second,yyless (n) may be called to indicate that not all the charactersmatched by the currently successful expression are wanted rightnow. The argument n indicates the number of characters in yytextto be retained. Further characters previously matched arereturned to the input. This provides the same sort of lookaheadoffered by the / operator, but in a different form.

Example: Consider a language which defines a string as a setof characters between quotation (") marks, and provides that toinclude a " in a string it must be preceded by a \. The regularexpression which matches that is somewhat confusing, so that itmight be preferable to write

                  \"[^"]*   {
if (yytext[yyleng-1] == '\\')
yymore();
else
... normal user processing
}
which will, when faced with a string such as "abc\"def" firstmatch the five characters "abc\; then the call to yymore() willcause the next part of the string, "def, to be tacked on the end.Note that the final quote terminating the string should be pickedup in the code labeled ``normal processing''.

The function yyless() might be used to reprocess text invarious circumstances. Consider the C problem of distinguishingthe ambiguity of ``=-a''. Suppose it is desired to treat this as``=- a'' but print a message. A rule might be

                 =-[a-zA-Z]   {
printf("Op (=-) ambiguous\n");
yyless(yyleng-1);
... action for =- ...
}
which prints a message, returns the letter after the operator tothe input stream, and treats the operator as ``=-''.Alternatively it might be desired to treat this as ``= -a''. Todo this, just return the minus sign as well as the letter to theinput:
                 =-[a-zA-Z]   {
printf("Op (=-) ambiguous\n");
yyless(yyleng-2);
... action for = ...
}
will perform the other interpretation. Note that the expressionsfor the two cases might more easily be written
                               =-/[A-Za-z]
in the first case and
                                 =/-[A-Za-z]
in the second; no backup would be required in the rule action. Itis not necessary to recognize the whole identifier to observe theambiguity. The possibility of ``=-3'', however, makes
                                 =-/[^ \t\n]
a still better rule.

In addition to these routines, Lex also permits access to theI/O routines it uses. They are:

1) input() which returns the next input character;

2) output(c) which writes the character c on the output; and

3) unput(c) pushes the character c back onto the input stream tobe read later by input().

By default these routines are provided as macro definitions, butthe user can override them and supply private versions. Theseroutines define the relationship between external files andinternal characters, and must all be retained or modifiedconsistently. They may be redefined, to cause input or output tobe transmitted to or from strange places, including other programsor internal memory; but the character set used must be consistentin all routines; a value of zero returned by input must mean endof file; and the relationship between unput and input must beretained or the Lex lookahead will not work. Lex does not lookahead at all if it does not have to, but every rule ending in + *? or $ or containing / implies lookahead. Lookahead is alsonecessary to match an expression that is a prefix of anotherexpression. See below for a discussion of the character set usedby Lex. The standard Lex library imposes a 100 character limit onbackup.

Another Lex library routine that the user will sometimes wantto redefine is yywrap() which is called whenever Lex reaches anend-of-file. If yywrap returns a 1, Lex continues with the normalwrapup on end of input. Sometimes, however, it is convenient toarrange for more input to arrive from a new source. In this case,the user should provide a yywrap which arranges for new input andreturns 0. This instructs Lex to continue processing. Thedefault yywrap always returns 1.

This routine is also a convenient place to print tables,summaries, etc. at the end of a program. Note that it is notpossible to write a normal rule which recognizes end-of-file; theonly access to this condition is through yywrap. In fact, unlessa private version of input() is supplied a file containing nullscannot be handled, since a value of 0 returned by input is takento be end-of-file.

5. Ambiguous Source Rules.

Lex can handle ambiguous specifications. When more than oneexpression can match the current input, Lex chooses as follows:

1) The longest match is preferred.

2) Among rules which matched the same number of characters, therule given first is preferred.

Thus, suppose the rules

                      integer   keyword action ...;
[a-z]+ identifier action ...;
to be given in that order. If the input is integers, it is takenas an identifier, because [a-z]+ matches 8 characters whileinteger matches only 7. If the input is integer, both rules match7 characters, and the keyword rule is selected because it wasgiven first. Anything shorter (e.g. int) will not match theexpression integer and so the identifier interpretation is used.

The principle of preferring the longest match makes rulescontaining expressions like .* dangerous. For example,'.*'might seem a good way of recognizing a string in single quotes.But it is an invitation for the program to read far ahead, lookingfor a distant single quote. Presented with the input

                'first' quoted string here, 'second' here
the above expression will match
                   'first' quoted string here, 'second'
which is probably not what was wanted. A better rule is of theform
                                  '[^'\n]*'
which, on the above input, will stop after 'first'. Theconsequences of errors like this are mitigated by the fact thatthe . operator will not match newline. Thus expressions like .*stop on the current line. Don't try to defeat this withexpressions like (.|\n)+ or equivalents; the Lex generatedprogram will try to read the entire input file, causing internalbuffer overflows.

Note that Lex is normally partitioning the input stream, notsearching for all possible matches of each expression. This meansthat each character is accounted for once and only once. Forexample, suppose it is desired to count occurrences of both sheand he in an input text. Some Lex rules to do this might be

                                 she   s++;
he h++;
\n |
. ;
where the last two rules ignore everything besides he and she.Remember that . does not include newline. Since she includes he,Lex will normally not recognize the instances of he included inshe, since once it has passed a she those characters are gone.

Sometimes the user would like to override this choice. Theaction REJECT means ``go do the next alternative.'' It causeswhatever rule was second choice after the current rule to beexecuted. The position of the input pointer is adjustedaccordingly. Suppose the user really wants to count the includedinstances of he:

                            she   {s++; REJECT;}
he {h++; REJECT;}
\n |
. ;
these rules are one way of changing the previous example to dojust that. After counting each expression, it is rejected;whenever appropriate, the other expression will then be counted.In this example, of course, the user could note that she includeshe but not vice versa, and omit the REJECT action on he; in othercases, however, it would not be possible a priori to tell whichinput characters were in both classes.

Consider the two rules

                          a[bc]+   { ... ; REJECT;}
a[cd]+ { ... ; REJECT;}
If the input is ab, only the first rule matches, and on ad onlythe second matches. The input string accb matches the first rulefor four characters and then the second rule for three characters.In contrast, the input accd agrees with the second rule for fourcharacters and then the first rule for three.

In general, REJECT is useful whenever the purpose of Lex isnot to partition the input stream but to detect all examples ofsome items in the input, and the instances of these items mayoverlap or include each other. Suppose a digram table of theinput is desired; normally the digrams overlap, that is the wordthe is considered to contain both th and he. Assuming atwo-dimensional array named digram to be incremented, theappropriate source is

                %%
[a-z][a-z] {
digram[yytext[0]][yytext[1]]++;
REJECT;
}
. ;
\n ;
where the REJECT is necessary to pick up a letter pair beginningat every character, rather than at every other character.

6. Lex Source Definitions.

Remember the format of the Lex source:
                               {definitions}
%%
{rules}
%%
{user routines}
So far only the rules have been described. The user needsadditional options, though, to define variables for use in hisprogram and for use by Lex. These can go either in thedefinitions section or in the rules section.

Remember that Lex is turning the rules into a program. Anysource not intercepted by Lex is copied into the generatedprogram. There are three classes of such things.

1) Any line which is not part of a Lex rule or action whichbegins with a blank or tab is copied into the Lex generatedprogram. Such source input prior to the first %% delimiter willbe external to any function in the code; if it appears immediatelyafter the first %%, it appears in an appropriate place fordeclarations in the function written by Lex which contains theactions. This material must look like program fragments, andshould precede the first Lex rule. As a side effect of the above,lines which begin with a blank or tab, and which contain acomment, are passed through to the generated program. This can beused to include comments in either the Lex source or the generatedcode. The comments should follow the host language convention.

2) Anything included between lines containing only %{ and %} iscopied out as above. The delimiters are discarded. This formatpermits entering text like preprocessor statements that must beginin column 1, or copying lines that do not look like programs.

3) Anything after the third %% delimiter, regardless of formats,etc., is copied out after the Lex output.

Definitions intended for Lex are given before the first %%delimiter. Any line in this section not contained between %{ and%}, and begining in column 1, is assumed to define Lexsubstitution strings. The format of such lines isname translationand it causes the string given as a translation to be associatedwith the name. The name and translation must be separated by atleast one blank or tab, and the name must begin with a letter.The translation can then be called out by the {name} syntax in arule. Using {D} for the digits and {E} for an exponent field, forexample, might abbreviate rules to recognize numbers:

                   D                   [0-9]
E [DEde][-+]?{D}+
%%
{D}+ printf("integer");
{D}+"."{D}*({E})? |
{D}*"."{D}+({E})? |
{D}+{E}
Note the first two rules for real numbers; both require a decimalpoint and contain an optional exponent field, but the firstrequires at least one digit before the decimal point and thesecond requires at least one digit after the decimal point. Tocorrectly handle the problem posed by a Fortran expression such as35.EQ.I, which does not contain a real number, a context-sensitiverule such as
                      [0-9]+/"."EQ   printf("integer");
could be used in addition to the normal rule for integers.

The definitions section may also contain other commands,including the selection of a host language, a character set table,a list of start conditions, or adjustments to the default sizeof arrays within Lex itself for larger source programs. Thesepossibilities are discussed below under ``Summary of SourceFormat,'' section 12.

7. Usage.

There are two steps in compiling a Lex source program.First, the Lex source must be turned into a generated program inthe host general purpose language. Then this program must becompiled and loaded, usually with a library of Lex subroutines.The generated program is on a file named lex.yy.c. The I/Olibrary is defined in terms of the C standard library [6].

The C programs generated by Lex are slightly different onOS/370, because the OS compiler is less powerful than the UNIX orGCOS compilers, and does less at compile time. C programsgenerated on GCOS and UNIX are the same.

UNIX. The library is accessed by the loader flag -ll. So anappropriate set of commands islex source cc lex.yy.c -llThe resulting program is placed on the usual file a.out for laterexecution. To use Lex with Yacc see below. Although the defaultLex I/O routines use the C standard library, the Lex automatathemselves do not do so; if private versions of input, output andunput are given, the library can be avoided.

8. Lex and Yacc.

If you want to use Lex with Yacc, note that what Lex writesis a program named yylex(), the name required by Yacc for itsanalyzer. Normally, the default main program on the Lex librarycalls this routine, but if Yacc is loaded, and its main program isused, Yacc will call yylex(). In this case each Lex rule shouldend with
                               return(token);
where the appropriate token value is returned. An easy way to getaccess to Yacc's names for tokens is to compile the Lex outputfile as part of the Yacc output file by placing the line# include "lex.yy.c"in the last section of Yacc input. Supposing the grammar to benamed ``good'' and the lexical rules to be named ``better'' theUNIX command sequence can just be:
                             yacc good
lex better
cc y.tab.c -ly -ll
The Yacc library (-ly) should be loaded before the Lex library, toobtain a main program which invokes the Yacc parser. Thegenerations of Lex and Yacc programs can be done in either order.

9. Examples.

As a trivial problem, consider copying an input file whileadding 3 to every positive number divisible by 7. Here is asuitable Lex source program
                      %%
int k;
[0-9]+ {
k = atoi(yytext);
if (k%7 == 0)
printf("%d", k+3);
else
printf("%d",k);
}
to do just that. The rule [0-9]+ recognizes strings of digits;atoi converts the digits to binary and stores the result in k.The operator % (remainder) is used to check whether k is divisibleby 7; if it is, it is incremented by 3 as it is written out. Itmay be objected that this program will alter such input items as49.63 or X7. Furthermore, it increments the absolute value of allnegative numbers divisible by 7. To avoid this, just add a fewmore rules after the active one, as here:
                %%
int k;
-?[0-9]+ {
k = atoi(yytext);
printf("%d",
k%7 == 0 ? k+3 : k);
}
-?[0-9.]+ ECHO;
[A-Za-z][A-Za-z0-9]+ ECHO;
Numerical strings containing a ``.'' or preceded by a letter willbe picked up by one of the last two rules, and not changed. Theif-else has been replaced by a C conditional expression to savespace; the form a?b:c means ``if a then b else c''.

For an example of statistics gathering, here is a programwhich histograms the lengths of words, where a word is definedas a string of letters.

                           int lengs[100];
%%
[a-z]+ lengs[yyleng]++;
. |
\n ;
%%
yywrap()
{
int i;
printf("Length No. words\n");
for(i=0; i<100; i++)
if (lengs[i] > 0)
printf("%5d%10d\n",i,lengs[i]);
return(1);
}
This program accumulates the histogram, while producing no output.At the end of the input it prints the table. The final statementreturn(1); indicates that Lex is to perform wrapup. If yywrapreturns zero (false) it implies that further input is availableand the program is to continue reading and processing. Toprovide a yywrap that never returns true causes an infinite loop.

As a larger example, here are some parts of a program writtenby N. L. Schryer to convert double precision Fortran to singleprecision Fortran. Because Fortran does not distinguish upper andlower case letters, this routine begins by defining a set ofclasses including both cases of each letter:

                                 a     [aA]
b [bB]
c [cC]
...
z [zZ]
An additional class recognizes white space:
                                 W   [ \t]*
The first rule changes ``double precision'' to ``real'', or``DOUBLE PRECISION'' to ``REAL''.
             {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
printf(yytext[0]=='d'? "real" : "REAL");
}
Care is taken throughout this program to preserve the case (upperor lower) of the original program. The conditional operator isused to select the proper form of the keyword. The next rulecopies continuation card indications to avoid confusing themwith constants:
                            ^"     "[^ 0]   ECHO;
In the regular expression, the quotes surround the blanks. It isinterpreted as ``beginning of line, then five blanks, thenanything but blank or zero.'' Note the two different meanings of^. There follow some rules to change double precision constantsto ordinary floating constants.
                  [0-9]+{W}{d}{W}[+-]?{W}[0-9]+     |
[0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ |
"."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ {
/* convert constants */
for(p=yytext; *p != 0; p++)
{
if (*p == 'd' || *p == 'D')
*p=+ 'e'- 'd';
ECHO;
}
After the floating point constant is recognized, it is scanned bythe for loop to find the letter d or D. The program than adds'e'-'d', which converts it to the next letter of the alphabet.The modified constant, now single-precision, is written out again.There follow a series of names which must be respelled to removetheir initial d. By using the array yytext the same actionsuffices for all the names (only a sample of a rather long list isgiven here).
                 {d}{s}{i}{n}         |
{d}{c}{o}{s} |
{d}{s}{q}{r}{t} |
{d}{a}{t}{a}{n} |
...
{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
Another list of names must have initial d changed to initial a:
                  {d}{l}{o}{g}     |
{d}{l}{o}{g}10 |
{d}{m}{i}{n}1 |
{d}{m}{a}{x}1 {
yytext[0] =+ 'a' - 'd';
ECHO;
}
And one routine must have initial d changed to initial r:
                {d}1{m}{a}{c}{h}   {yytext[0] =+ 'r'  - 'd';
To avoid such names as dsinx being detected as instances of dsin,some final rules pick up longer words as identifiers and copy somesurviving characters:
                        [A-Za-z][A-Za-z0-9]*   |
[0-9]+ |
\n |
. ECHO;
Note that this program is not complete; it does not deal with thespacing problems in Fortran or with the use of keywords asidentifiers.

10. Left Context Sensitivity.

Sometimes it is desirable to have several sets of lexicalrules to be applied at different times in the input. For example,a compiler preprocessor might distinguish preprocessorstatements and analyze them differently from ordinary statements.This requires sensitivity to prior context, and there are severalways of handling such problems. The ^ operator, for example, isa prior context operator, recognizing immediately preceding leftcontext just as $ recognizes immediately following rightcontext. Adjacent left context could be extended, to produce afacility similar to that for adjacent right context, but it isunlikely to be as useful, since often the relevant left contextappeared some time earlier, such as at the beginning of a line.

This section describes three means of dealing with differentenvironments: a simple use of flags, when only a few ruleschange from one environment to another, the use of startconditions on rules, and the possibility of making multiplelexical analyzers all run together. In each case, there are ruleswhich recognize the need to change the environment in which thefollowing input text is analyzed, and set some parameter toreflect the change. This may be a flag explicitly tested by theuser's action code; such a flag is the simplest way of dealingwith the problem, since Lex is not involved at all. It may bemore convenient, however, to have Lex remember the flags asinitial conditions on the rules. Any rule may be associated witha start condition. It will only be recognized when Lex is in thatstart condition. The current start condition may be changed atany time. Finally, if the sets of rules for the differentenvironments are very dissimilar, clarity may be best achieved bywriting several distinct lexical analyzers, and switching from oneto another as desired.

Consider the following problem: copy the input to the output,changing the word magic to first on every line which began withthe letter a, changing magic to second on every line which beganwith the letter b, and changing magic to third on every line whichbegan with the letter c. All other words and all other lines areleft unchanged.

These rules are so simple that the easiest way to do this jobis with a flag:

                         int flag;
%%
^a {flag = 'a'; ECHO;}
^b {flag = 'b'; ECHO;}
^c {flag = 'c'; ECHO;}
\n {flag = 0 ; ECHO;}
magic {
switch (flag)
{
case 'a': printf("first"); break;
case 'b': printf("second"); break;
case 'c': printf("third"); break;
default: ECHO; break;
}
}
should be adequate.

To handle the same problem with start conditions, each startcondition must be introduced to Lex in the definitions sectionwith a line reading

                          %Start   name1 name2 ...
where the conditions may be named in any order. The word Startmay be abbreviated to s or S. The conditions may be referencedat the head of a rule with the <> brackets:
                              expression
is a rule which is only recognized when Lex is in the startcondition name1. To enter a start condition, execute the actionstatement
                                BEGIN name1;
which changes the start condition to name1. To resume the normalstate,
                                  BEGIN 0;
resets the initial condition of the Lex automaton interpreter. Arule may be active in several start conditions:is a legal prefix. Any rule not beginning with the <> prefixoperator is always active.

The same example as before can be written:

                     %START AA BB CC
%%
^a {ECHO; BEGIN AA;}
^b {ECHO; BEGIN BB;}
^c {ECHO; BEGIN CC;}
\n {ECHO; BEGIN 0;}
magic printf("first");
magic printf("second");
magic printf("third");
where the logic is exactly the same as in the previous method ofhandling the problem, but Lex does the work rather than the user'scode.

11. Character Set.

The programs generated by Lex handle character I/O onlythrough the routines input, output, and unput. Thus thecharacter representation provided in these routines is accepted byLex and employed to return values in yytext. For internal use acharacter is represented as a small integer which, if the standardlibrary is used, has a value equal to the integer value of the bitpattern representing the character on the host computer.Normally, the letter a is represented as the same form as thecharacter constant 'a'. If this interpretation is changed, byproviding I/O routines which translate the characters, Lex must betold about it, by giving a translation table. This table must bein the definitions section, and must be bracketed by lines containingonly ``%T''. The table contains lines of the form
                        {integer} {character string}
which indicate the value associated with each character. Thus thenext example
                                  %T
1 Aa
2 Bb
...
26 Zz
27 \n
28 +
29 -
30 0
31 1
...
39 9
%T

Sample character table.
maps the lower and upper case letters together into the integers 1through 26, newline into 27, + and - into 28 and 29, and thedigits into 30 through 39. Note the escape for newline. If atable is supplied, every character that is to appear either in therules or in any valid input must be included in the table. Nocharacter may be assigned the number 0, and no character may beassigned a bigger number than the size of the hardware characterset.

12. Summary of Source Format.

The general form of a Lex source file is:
                             {definitions}
%%
{rules}
%%
{user subroutines}
The definitions section contains a combination of

1) Definitions, in the form ``name space translation''.

2) Included code, in the form ``space code''.

3) Included code, in the form

                                       %{
code
%}
4) Start conditions, given in the form
                                %S name1 name2 ...
5) Character set tables, in the form
                          %T
number space character-string
...
%T
6) Changes to internal array sizes, in the form
                                     %x  nnn
where nnn is a decimal integer representing an array size andx selects the parameter as follows:
                        Letter          Parameter
p positions
n states
e tree nodes
a transitions
k packed character classes
o output array size
Lines in the rules section have the form ``expression action''where the action may be continued on succeeding lines by usingbraces to delimit it.

Regular expressions in Lex use the following operators:

               x        the character "x"
"x" an "x", even if x is an operator.
\x an "x", even if x is an operator.
[xy] the character x or y.
[x-z] the characters x, y or z.
[^x] any character but x.
. any character but newline.
^x an x at the beginning of a line.
x an x when Lex is in start condition y.
x$ an x at the end of a line.
x? an optional x.
x* 0,1,2, ... instances of x.
x+ 1,2,3, ... instances of x.
x|y an x or a y.
(x) an x.
x/y an x but only if followed by y.
{xx} the translation of xx from the
definitions section.
x{m,n} m through n occurrences of x

13. Caveats and Bugs.

There are pathological expressions which produce exponentialgrowth of the tables when converted to deterministic machines;fortunately, they are rare.

REJECT does not rescan the input; instead it remembers theresults of the previous scan. This means that if a rule withtrailing context is found, and REJECT executed, the user must nothave used unput to change the characters forthcoming from theinput stream. This is the only restriction on the user's abilityto manipulate the not-yet-processed input.

14. Acknowledgments.

As should be obvious from the above, the outside of Lex ispatterned on Yacc and the inside on Aho's string matchingroutines. Therefore, both S. C. Johnson and A. V. Aho are reallyoriginators of much of Lex, as well as debuggers of it. Manythanks are due to both.

The code of the current version of Lex was designed, written,and debugged by Eric Schmidt.

15. References.

1. B. W. Kernighan and D. M. Ritchie, The C ProgrammingLanguage, Prentice-Hall, N. J. (1978).

2. B. W. Kernighan, Ratfor: A Preprocessor for a Rational Fortran,Software Practice and Experience, 5, pp. 395-496 (1975).

3. S. C. Johnson, Yacc: Yet Another Compiler Compiler, ComputingScience Technical Report No. 32, 1975, Bell Laboratories,Murray Hill, NJ 07974.

4. A. V. Aho and M. J. Corasick, Efficient String Matching:An Aid to Bibliographic Search, Comm. ACM 18, 333-340 (1975).

5. B. W. Kernighan, D. M. Ritchie and K. L. Thompson, QED TextEditor, Computing Science Technical Report No. 5, 1972,Bell Laboratories, Murray Hill, NJ 07974.

6. D. M. Ritchie, private communication. See also M. E. Lesk,The Portable C Library, Computing Science Technical ReportNo. 31, Bell Laboratories, Murray Hill, NJ 07974.