flex(1)flex(1)NAMEflex - Generates a C Language lexical analyzer
SYNOPSISflex [-bcdfinpstvFILT8] -C[efmF] [-Sskeleton] [file...]
OPTIONS
Generates backtracking information to lex.backtrack. This is a list of
scanner states that require backtracking and the input characters on
which they do so. By adding rules you can remove backtracking states.
If all backtracking states are eliminated and -f or -F is used, the
generated scanner will run faster. Makes the generated scanner run in
debug mode. Whenever a pattern is recognized and the global
yy_lex_debug is nonzero (which is the default), the scanner writes to
stderr a line of the form:
--accepting rule at line 53 ("the matched text")
The line number refers to the location of the rule in the file
defining the scanner (the input to lex). Messages are also gen‐
erated when the scanner backtracks, accepts the default rule,
reaches the end of its input buffer (or encounters a NULL), or
reaches an End-of-File. Specifies full table (no table compres‐
sion is done). The result is large but fast. This option is
equivalent to -Cf. Instructs flex to generate a case-insensi‐
tive scanner. The case of letters given in the flex input pat‐
terns will be ignored, and tokens in the input will be matched
regardless of case. The matched text given in yytext will have
the original case (as read by the scanner). Generates a perfor‐
mance report to stderr. This identifies features of the flex
input file that will cause a loss of performance in the result‐
ing scanner. Causes the default rule (that unmatched scanner
input is echoed to stdout) to be suppressed. If the scanner
encounters input that does not match any of its rules, it aborts
with an error. Instructs flex to write the scanner it generates
to standard output instead of lex.yy.c. Specifies that flex
should write to stderr a summary of statistics regarding the
scanner it generates. Specifies that the fast scanner table
representation should be used. This representation is about as
fast as the full table representation (-f), and for some sets of
patterns will be considerably smaller (and for others, larger).
This option is equivalent to -CF. Instructs flex to generate an
interactive scanner; that is, a scanner that stops immediately
rather than looking ahead if it knows that the currently scanned
text cannot be part of a longer rule's match. Note, -I cannot be
used in conjunction with full or fast tables; that is, the -f,
-F, -Cf, or -CF options. Instructs flex not to generate #line
directives in lex.yy.c. The default is to generate such direc‐
tives so error messages in the actions will be correctly located
with respect to the original lex input file. Makes flex run in
trace mode. It will generate a lot of messages to stdout con‐
cerning the form of the input and the resultant nondeterministic
and deterministic finite automata. This option is mostly for
use in maintaining flex. Instructs flex to generate an 8-bit
scanner (which is the default). Controls the degree of table
compression. The default setting is -Cem which provides the
highest degree of table compression. Faster-executing scanners
can be traded off at the cost of larger tables with the follow‐
ing generally being true:
Slowest and smallest
-Cem -Cm -Ce -C -C{f,F}e -C{f,F}
Fastest and largest
The -C options are not cumulative; whenever the option is
encountered, the previous -C settings are forgotten. The -f or
-F and -Cm options do not make sense together; there is no
opportunity for meta-equivalence classes if the table is not
being compressed. Otherwise, the options may be freely mixed.
A lone -C specifies that the scanner tables should be compressed
and neither equivalence classes nor meta-equivalence classes
should be used. Directs flex to construct equivalence classes;
for example, sets of characters that have identical lexical
properties. Equivalence classes usually give dramatic reductions
in the final table/object file sizes (typically a factor of 2 to
5) and are inexpensive performance-wise (one array look-up per
character scanned). Directs flex to construct meta-equivalence
classes, which are sets of equivalence classes (or characters,
if equivalence classes are not being used) that are commonly
used together. Meta-equivalence classes are often a big win
when using compressed tables, but they have a moderate perfor‐
mance impact (one or two “if” tests and one array look-up per
character scanned). Specifies that the full scanner tables
should be generated; flex should not compress the tables by tak‐
ing advantage of similar transition functions for different
states. Specifies that the alternative fast scanner representa‐
tion should be used. Overrides the default skeleton file from
which flex constructs its scanners. This is useful for flex
maintenance or development. Specifies table-compression
options. (Obsolescent) Suppresses the statistics summaries that
the -v option typically generates. (Obsolete)
DESCRIPTION
The flex command is a tool for generating scanners: programs which rec‐
ognize lexical patterns in text. The flex command reads the given input
files, or its standard input if no filenames are given or if a file op‐
erand is - (dash) for a description of a scanner to generate. The
description is in the form of pairs of regular expressions and C code,
called rules. The flex command generates as output a C source file,
lex.yy.c, which defines a routine yylex(). This file is compiled and
linked with the -ll library to produce an executable. When the exe‐
cutable is run, it scans its input and the regular expressions in its
rules looking for the best match (longest input). When it has selected
a rule it executes the associated C code which has access to the
matched input sequence (commonly referred to as a token). This process
then repeats until input is exhausted.
The flex command treats multiple input files as one.
Syntax for Input
This section contains a description of the flex input file, which is
normally named with a suffix. The section provides a listing of the
special values, macros, and functions recognized by flex.
The flex input file consists of three sections, separated by a line
with just %% in it:
[ definitions ] %% [ rules ] [ %% [ user functions ]]
Contains declarations to simplify the scanner specification, and decla‐
rations of start states which are explained below. Describes what the
scanner is to do. Contains user-supplied functions that copied
straight through to lex.yy.c.
With the exception of the first %% sequence all sections are
optional. The minimal scanner %%, copies its input to standard
output.
Each line in the definitions section can be: Defines name to expand to
regexp. name is a word beginning with a letter or an underscore (_)
followed by zero or more letters, digits, underscores or dashes (-). In
the regular-expression parts of the rules section, flex substitutes
regexp wherever you refer to {name} (name within braces). Defines
names for states used in the rules section. A rule may be made condi‐
tionally active based on the current scanner state. Multiple lines
defining states can appear, and each can contain multiple state names,
separated by white space. The name of a state follows the same syntax
as that of regexp names except that dashes ('-') are not permitted.
Unlike regexp names, state names share the C #define namespace. In the
rules section states are recognized as <state> (state within angle
brackets).
The %x directive names exclusive states. When a scanner is in
an exclusive state, only rules prefixed with that state are
active. Inclusive states are named with the %s directive. When
placed on lines by themselves, these symbols enclose C code to
be passed verbatim into the global definitions of the output
file. Such lines commonly include preprocessor directives and
declarations of external variables and functions. Lines begin‐
ning with a space or tab in the definitions section are passed
directly into the lex.yy.c output file, as part of the initial
global definitions.
The rules section follows the definitions, separated by a line consist‐
ing of %%. The rules section contains rules for matching input and
taking actions, in the following format: pattern [action]
The pattern starts in the first column of the line and extends until
the first non-escaped white space character. The flex command attempts
to find the pattern that matches the longest input sequence and execute
the associated action. If two or more patterns match the same input the
one which appears first in the rules section is chosen. If no action
exists the matched input is discarded. If no pattern matches the input
the default is to copy it to standard output.
All action code is placed in the yylex() function. Text (C code or dec‐
larations) placed at the beginning of the rules section is copied to
the beginning of the yylex() function and may be used in actions. This
text must begin with a space or a tab (to distinguish it from rules).
In addition, any input (beginning with a space or within %{ and %}
delimiter lines) appearing at the beginning of the rules section before
any rules are specified will be written to lex.yy.c after the declara‐
tions of variables for the yylex() function and before the first line
of code in yylex().
Elements of each rule are: A pattern may begin with a comma separated
list of state names enclosed by angle brackets (< state [,state...]
>). These states are entered via the BEGIN statement. If a pattern
begins with a state, the scanner can only recognize it when in that
state. The initial state is 0 (zero). A regular expression to match
against the input stream. The regular expressions in flex provide a
rich character matching syntax.
The following characters, shown in order of decreasing prece‐
dence have special meanings: Matches the character x. Enclose
characters and treat them as literal strings. For example, "*+"
is treated as the asterisk character followed by the plus char‐
acter. If str is one of the characters a, b, f, n, r, t, or v,
then the ANSI C interpretation is adopted (for example, \n is a
newline). If str is a string of octal digits it is interpreted
as a character with octal value str. If str is a string of hexa‐
decimal digits with a leading x it is interpreted as a character
with that value. Otherwise, it is interpreted literally with no
special meaning. For example, x\*yz represents the four charac‐
ters x*yz. Represents a character class in the enclosed range
([.-.]) or the enclosed list ([...]). The dash character is
used to define a range of characters from the ASCII value or the
8-bit class of the character that comes before it to the ASCII
value or the 8-bit class of the character that follows it. For
example, [abcx-z] matches a, b, c, x, y, or z.
The circumflex when it appears as the first character in a char‐
acter class, indicates the complement of the set of characters
within that class. For example, [^abc] matches any character
except a, b or c, including special characters like newline.
Groups regular expressions. For example, (ab) will be considered
as a single regular expression. When enclosing numbers, indi‐
cates a number of consecutive occurrences of the expression that
comes before it. For example, (ab){1,5} indicates a match for
from 1 to 5 occurrences of the string ab.
When enclosing a name, the name represents a regular expression
defined in the definitions section. For example, {digit} is
replaced by the defined regular expression for digit. Note that
the expansion takes place as if the definition were enclosed in
parentheses. Matches any single character except newline.
Matches zero or one of the preceding expressions. For example,
ab?c matches both ac and abc. Matches zero or more of the pre‐
ceding expressions. For example, a* is zero or more consecutive
a characters. The utility of matching zero occurrences is more
obvious in complicated expressions. For example, the expres‐
sion, [A-Za-z][A-Za-z0-9]* indicates all alphanumeric strings
with a leading alphabetic character, including strings that are
only one alphabetic character. Matches one or more of the pre‐
ceding expressions. For example, [a-z]+ is all strings of lower‐
case letters. Matches the expression x followed by the expres‐
sion y. Matches either the preceding expression or the follow‐
ing expression. For example, a(br matches either ab or cd.
Matches expression x only if expression y (trailing context)
immediately follows it. For example, ab/cd matches the string ab
but only if followed by cd. Only one trailing context is permit‐
ted per pattern. When it appears at the beginning of the pat‐
tern matches the beginning of a line. For example, ^abc will
match the string abc if it is found at the beginning of a line.
When it appears at the end of a pattern matches the end of a
line. It is equivalent to /\n. For example, abc$ will match the
string abc if it is found at the end of a line. Matches an End-
of-File. Identifies a state name (see above) and may only
appear at the beginning of a pattern. For example, <done><<EOF>>
matches an End-of-File, but only if it is in state done.
In addition, the following rules apply for bracket expressions:
These represent the set of collating elements in an equivalence
class and are enclosed within bracket-equal delimiters ([= =]).
An equivalence class generally is designed to deal with primary-
secondary sorting; that is, for languages like French that
define groups of characters as sorting to the same primary loca‐
tion, and then have a tie-breaking, secondary sort. For example,
if a, `, and ^ belong to the same equivalence class, then
[[=a=]b], [[=`=]b], and [[=^=]b] are each equivalent to [a`^b].
These represent the set of characters in the current locale
belonging to the named ctype class. These are expressed as a
ctype class name enclosed in bracket-colon delimiters ([: :]).
In the C or POSIX locale, this operating system supports the
following character class expressions: [:alpha:], [:upper:],
[:lower:], [:digit:], [:alnum:], [:xdigit:], [:space:],
[:print:], [:punct:], [:graph:], [:cntrl:].
Other locales may define additional character classes.
Letters and digits never have special meanings. A character
such as ^ or -, which has a special meaning in particular con‐
texts, refers simply to itself when found outside that context.
Spaces and tabs must be escaped to appear in a regular expres‐
sion; otherwise they indicate the end of the expression. Each
pattern in a rule has a corresponding action, which can be any
arbitrary C statement. The pattern ends at the first non-escaped
white space character; the remainder of the line is its action.
If the action is empty, then when the pattern is matched the
input which matched it is discarded.
If the action contains a {, then the action spans till the bal‐
ancing } is found, and the action may cross multiple lines.
Using a return statement in an action returns from yylex().
An action consisting solely of a vertical bar (|) means same as
the action for the next rule.
The flex variables which can be used within actions are: A
string (char *) containing the current matched input. It cannot
be modified. The length (int) of the current matched input. It
cannot be modified. A stream (FILE *) that flex reads from
(stdin by default). It may be changed but because of the buffer‐
ing flex uses this makes sense only before scanning begins. Once
scanning terminates because an End-of-File was seen, void
yyrestart (FILE *new_file) may be called to point yyin at a new
input file. Alternatively, yyin may be changed whenever a new or
different buffer is selected (see yy_switch_to_buffer()). A
stream (FILE *) to which ECHO output is written (stdout by
default). It can be changed by the user. Returns the current
buffer (YY_BUFFER_STATE) used for scanner input.
The flex command macros and functions that may be used within
actions are: Copies yytext to the scanner's output. Changes the
scanner state to be state. This affects which rules are active.
The state must be defined in a %s, or %x definition. The ini‐
tial state of the scanner is INITIAL or 0 (zero). Directs the
scanner to proceed immediately to the next best pattern that
matches the input (which may be a prefix of the current match).
yytext and yyleng are reset appropriately. Note that REJECT is
a particularly expensive feature in terms of scanner perfor‐
mance; if it is used in any of the scanner's actions, it will
slow down all of the scanner's pattern matching operations.
REJECT cannot be used if flex is invoked with either -f or -F
options. Indicates that the next matched text should be
appended to the currently matched text in yytext (rather than
replace it). Returns all but the first n characters of the cur‐
rent token back to the input stream, where they will be res‐
canned when the scanner looks for the next match. yytext and
yyleng are adjusted accordingly. Returns 0 (zero) if there is
more input to scan or 1 if there is not. The default yywrap()
always returns 1. Currently it is implemented as a macro, how‐
ever in future implementations it may become a function. Can be
used in lieu of a return statement in an action. It terminates
the scanner and returns a 0 (zero) to the scanner's caller.
yyterminate() is automatically called when an End-of-File is
encountered. It is a macro and may be redefined. Returns a
YY_BUFFER_STATE handle to a new input buffer large enough to
accommodate size characters and associated with the given file.
When in doubt, use YY_BUF_SIZE for the size. Switches the scan‐
ner's processing to scan for tokens from the given buffer, which
must be a YY_BUFFER_STATE. Deletes the given buffer. Enables
scanning to continue after yyin has been pointed at a new file
to process. Controls how the scanning function, yylex() is
declared. By default, it is int yylex(), or, if prototypes are
being used, int yylex(void). This definition may be changed by
redefining the YY_DECL macro. This macro is expanded immedi‐
ately before the {...} (braces) that delimit the scanner func‐
tion body. Controls scanner input. By default, YY_INPUT reads
from the file-pointer yyin. Its action is to place up to
max_size characters in the character array buf and return in the
integer variable result either the number of characters read or
the constant YY_NULL to indicate EOF. Following is a sample
redefinition of YY_INPUT, in the definitions section of the
input file:
%{ #undef YY_INPUT #define YY_INPUT(buf,result,max_size)\
{\
int c = getchar();\
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1);\
} %}
When the scanner receives an End-of-File indication from
YY_INPUT, it checks the yywrap() function. If yywrap() returns
zero, it is assumed that the yyin has been set up to point to
another input file, and scanning continues. If it returns non-
zero, then the scanner terminates, returning zero to its caller.
Redefinable to provide an action which is always executed prior
to the matched pattern's action. Redefinable to provide an
action which is always executed before the first scan. Is used
in the scanner to separate different actions. By default, it is
simply a break, but may be redefined if necessary.
The user functions section consists of complete C functions, which are
passed directly into the lex.y.cc output file (the effect is similar to
defining the functions in separate files and linking them with
lex.y.cc). This section is separated from the rules section by the %%
delimiter.
Comments, in C syntax, can appear anywhere in the user functions or
definitions sections. In the rules section, comments can be embedded
within actions. Empty lines or lines consisting of white space are
ignored.
The following macros are not normally called explicitly within an
action, but are used internally by flex to handle the input and output
streams. Reads the next character from the input stream. You cannot
redefine input(). Writes the next character to the output stream.
Puts the character c back onto the input stream. It will be the next
character scanned. You cannot redefine unput().
The libl.a contains default functions to support testing or
quick use of a flex program without yacc; these functions can be
linked in through -ll. They can also be provided by the user.
A simple wrapper that simply calls setlocale() and then calls
the yylex() function. The function called when the scanner
reaches the end of an input stream. The default definition sim‐
ply returns 1, which causes the scanner in turn to return 0
(zero).
NOTES
Some trailing context patterns cannot be properly matched and generate
warning messages
Dangerous trailing context
These are patterns where the ending of the first part of the
rule matches the beginning of the second part, such as zx*/xy*,
where the x* matches the x at the beginning of the trailing con‐
text. For some trailing context rules, parts that are actually
fixed length are not recognized as such, leading to the previ‐
ously mentioned performance loss. In particular, patterns using
{n} (such as test{3}) are always considered variable length.
Combining trailing context with the special | (vertical bar)
action can result in fixed trailing context being turned into
the more expensive variable trailing context. This happens in
the following example:
%% abc| xyz/def Use of unput() invalidates the contents of
yytext and yyleng within the current flex action. Use of
unput() to push back more text than was matched can result in
the pushed-back text matching a beginning-of-line (^) rule even
though it did not come at the beginning of the line. Pattern
matching of NULLs is substantially slower than matching other
characters. The flex command does not generate correct #line
directives for code internal to the scanner; thus, bugs in
flex.skel yield invalid line numbers. Due to both buffering of
input and read-ahead, you cannot intermix calls to <stdio.h>
routines, such as, for example, getchar(), with flex rules and
expect it to work. Call input() instead. The total table
entries listed by the -v option excludes the number of table
entries needed to determine what rule was matched. The number
of entries is equal to the number of deterministic finite-state
automaton (DFA) states if the scanner does not use REJECT, and
somewhat greater than the number of states if it does. REJECT
cannot be used with the -f or -F options.
EXAMPLES
The following command processes the file lexcommands to produce the
scanner file lex.yy.c: flex lexcommands
This is then compiled and linked by the command: cc -oscanner
lex.yy.c -ll
This produces a program scanner. The scanner program converts
uppercase to lowercase letters, removes spaces at the end of a
line, and replaces multiple spaces with single spaces. The lex‐
commands command contains:
%% [A-Z] putchar(tolower(yytext[0])); [ ]+$ [ ]+ putchar('
');
FILES
Skeleton scanner. Generated scanner C source. Backtracking informa‐
tion generated from -b option.
SEE ALSO
Commands: yacc(1), sed(1), awk(1)
Files: locale(4)flex(1)