Parse::Eyapp::eyappintUser)Contributed Perl DocumenParse::Eyapp::eyappintro(3)NAME
Parse::Eyapp::eyappintro - An introduction to Parse::Eyapp
SYNOPSIS
# File 'calc.eyp': translates infix expressions to postfix
# Compile it with: eyapp -o calc.pl -C Postfix.eyp
# Execution: ./calc.pl -c 'a = 2*3+b'
%token NUM = /([0-9]+(?:\.[0-9]+)?)/
%token VAR = /([A-Za-z][A-Za-z0-9_]*)/
%right '='
%left '-' '+'
%left '*' '/'
%left NEG
%defaultaction { "$left $right $op"; }
%%
line: $exp { print "$exp\n" }
;
exp: $NUM { $NUM }
| $VAR { $VAR }
| VAR.left '='.op exp.right
| exp.left '+'.op exp.right
| exp.left '-'.op exp.right
| exp.left '*'.op exp.right
| exp.left '/'.op exp.right
| '-' $exp %prec NEG { "$exp NEG" }
| '(' $exp ')' { $exp }
;
%%
INTRODUCTION TO PARSING WITH Parse::Eyapp
Introduction
Parsing is the activity of producing a syntax tree from an input
stream. The program example in the synopsis section shows an example of
a translation scheme. It translates infix arithmetic expressions like
a = 2 * 3 + 4 * b
into postfix expressions like
a 2 3 * 4 b * + =
The program contains a context free eyapp grammar defining the language
of arithmetic infix expressions. A context free grammar is a
mathematical device to define languages. To better see the grammar for
the example above we have to eliminate the semantic actions (We call
semantic actions to the sections delimited by curly brackets containing
Perl code). This can be done calling "eyapp" with options "-vc":
$ eyapp -vc Postfix.eyp
%token NUM =/([0-9]+(?:\.[0-9]+)?)/
%token VAR =/([A-Za-z][A-Za-z0-9_]*)/
%right '='
%left '-' '+'
%left '*' '/'
%left NEG
%%
line:
exp
;
exp:
NUM
| VAR
| VAR '=' exp
| exp '+' exp
| exp '-' exp
| exp '*' exp
| exp '/' exp
| '-' exp %prec NEG
| '(' exp ')'
;
%%
A grammar generates a language. A grammar is defined by a set of
production rules. A production rule has two components: a left hand
side which is a syntactic variable or non terminal and a right hand
side which is a phrase made of syntactic variables and terminals. The
left hand side (lhs) and the right hand side (rhs) are usually
separated by an arrow like in:
exp -> VAR = exp
A note: when you see a production rule like:
line: exp <+ ';'>
is not really a production rule but an abbreviation for two
productions. It stands for:
line : exp
| line ';' exp
;
A terminal or token never appears on the left hand side of a production
rule.
Ambiguity
The phrases of the language are those obtained successively applying
the production rules of the grammar until no more rules can be applied.
The successive substitutions must start from the "start" symbol of the
grammar ("line" in the example). Such legal sequence of substitutions
is known as a derivation. The following is an example of a legal
derivation (the big arrow "=>" is read derives):
line => exp => VAR = exp => VAR = exp + exp => VAR = exp + NUM => VAR = VAR + NUM
thus the phrase "VAR = VAR + NUM" belongs to the language generated by
the former grammar. A derivation like this can be seen as a tree. For
instance, the former derivation is equivalent (has the same
information) than the following tree:
+----+
|line|
+----+
|
+---+
|exp|
+---+
.-----.-----^-----.
+---+ +---+ +---+
|VAR| | = | |exp|
+---+ +---+ +---+
.-----+-----.
+---+ +---+ +---+
|exp| | + | |exp|
+---+ +---+ +---+
| |
+---+ +---+
|VAR| |NUM|
+---+ +---+
which can be written more succinctly:
line(exp(VAR, '=', exp(exp(VAR), '+', exp(NUM))))
or even more briefly:
VAR = (VAR + NUM)
Such a tree is called a syntax tree for the input "VAR = VAR + NUM". A
grammar is said to be ambiguous if there are phrases in the generated
language that have more than one syntax tree. The grammar in the
synopsis example is ambiguous. Here is an alternative tree for the same
phrase "VAR = VAR + NUM":
+----+
|line|
+----+
|
+---+
|exp|
+---+
.----------^-----.-----.
+---+ +---+ +---+
|exp| | + | |exp|
+---+ +---+ +---+
.-----+-----. |
+---+ +---+ +---+ +---+
|VAR| | = | |exp| |NUM|
+---+ +---+ +---+ +---+
|
+---+
|VAR|
+---+
or
line(exp(exp(VAR, '=', exp(VAR)), '+', exp(NUM)))
or
(VAR = VAR) + NUM
Semantic Actions and Attributes
"Parse::Eyapp" analyzes your grammar and produce a LALR parser.
Actually the SYNOPSIS example is more than a context free grammar: is a
translation scheme. A translation scheme scheme is a context free
grammar where the right hand sides of the productions have been
augmented with semantic actions (i.e. with chunks of Perl code):
A -> alpha { action(@_) } beta
The analyzer generated by Eyapp executes "{ action(@_) }" after all the
semantic actions associated with "alpha" have been executed and before
the execution of any of the semantic actions associated with "beta".
In a translation scheme each symbol occurrence has an associated
attribute. The embedded actions modify the attributes associated with
the symbols of the grammar:
A -> alpha { action(@_) } beta
Each symbol on the right hand side of a production rule has an
associated scalar attribute. In "eyapp" the attributes of the symbols
to the left of "action" are passed as arguments to "action" (in the
example, those of "alpha"). These arguments are preceded by a
reference to the syntax analyzer object. Therefore, you can access to
the attributes associated with the first, second, etc. symbols in the
right hand side using the notation:
$_[1], $_[2], ...
However it is better to refer to the attributes by names. This is the
purpose of the dot and dollar notations as in:
exp: $NUM { $NUM }
| $VAR { $VAR }
| VAR.left '='.op exp.right
| exp.left '+'.op exp.right
| exp.left '-'.op exp.right
| exp.left '*'.op exp.right
| exp.left '/'.op exp.right
| '-' $exp %prec NEG { "$exp NEG" }
| '(' $exp ')' { $exp }
;
By prefixing the symbol "NUM" by a "$" we can refer to the associated
attribute inside the semantic action as $NUM:
exp: $NUM { $NUM }
By postfixing the two appearances of "expr" with ".left" and ".right"
and the appearance of '+' with ".op" we can refer to the associates
attributes as $left, $right and $op instead of $_[1], $_[3] and $_[2]:
%defaultaction { "$left $right $op"; }
There is no way inside an ordinary "eyapp" program for an intermediate
"action" to access the attributes of the symbols on its right, i.e.
those associated with the symbols of "beta". This restriction is lifted
if you use the %metatree directive to build a full translation scheme.
See Parse::Eyapp::translationschemestut to know more about full
translation schemes.
Actions on the right hand side counts as symbols and so they can be
referenced by its positional argument in later actions in the same
production rule. For intermediate actions, the value returned by the
"action" is the attribute associated with such action. For an action at
the end of the rule:
A -> alpha { lastaction(@_) }
the returned value constitutes the attribute of the left hand side of
the rule (the attribute of "A" in this case). The action at the end of
the right hand side is called the action associated with the production
rule. When no explicit action has been associated with a production
rule the default action applies. In "Parse::Eyapp" the programmer can
define what is the default action through the %defaultaction directive:
%defaultaction { "$left $right $op"; }
Actually, intermediate actions are implemented via a trick. When
"eyapp" sees an intermediate action like:
A -> alpha { action(@_) } beta
it creates a new auxiliary syntactic variable "Temp":
Temp -> /* empty */ { action(@_) }
Solving Ambiguities via Precedence and Associativity Declarations
Notice that ambiguous grammars produce ambiguous translation schemes:
since a phrase may have two syntactic trees it will be more than one
tree-traversing and consequently more than one way to execute the
embedded semantic actions. Certainly different execution orders will
usually produce different results. Thus, syntactic ambiguities
translate onto semantic ambiguities. That is why it is so important to
resolve all the ambiguities and conflicts that may arise in our
grammars. This is the function of the %left and %right declarations on
the header section:
%right '=' # Lowest precedence
%left '-' '+' # + and - have more precedence than = Disambiguate a-b-c as (a-b)-c
%left '*' '/' # * and / have more precedence than + Disambiguate a/b/c as (a/b)/c
%left NEG # Disambiguate -a-b as (-a)-b and not as -(a-b)
Priority can be assigned to tokens by using the %left and %right
declarations. Tokens in lines below have more precedence than tokens in
line above. The idea behind this notation is this: Any ambiguity can
be seen as a parenthesizing problem. You can parenthesize left (in the
jargon this is called reduce) or parenthesize right (in the jargon,
shift). Recall the main points of yacc-like parsers related to
priorities:
· The directives
%left
%right
%nonassoc
can be used in the head section to declare the priority of a token
· The later the declaration line the higher the priority
· Tokens in the same line have the same priority. Ties will be solved
using the token associativity (whether they were declared %left or
%right)
· The precedence of a production rule (right hand side) is the
precedence of the last token in the right hand side
· If the parser emits a warning announcing a shift-reduce conflict or a
reduce-reduce conflict in your grammar, it likely means that your
grammar is ambiguous or not LALR. In such case, recompile the
grammar with "eyapp -v" and carefully study the ".output" file
generated. Detect which token and which rules are involved in the
conflict.
· In a shift-reduce conflict the default action is to shift (i.e.
associate right). This action can be changed if the production and
the token involved have explicit priorities
· Most of the time the presence of a reduce-reduce conflict means that
your grammar is ambiguous. Rewrite your grammar. Alternatively, use
the %conflict and %PREC directives (see example
"debuggintut/pascalenumeratedvsrangesolvedviadyn.eyp"). The default
action is to reduce by the first production.
· If the precedence of the production rule is higher the shift-reduce
conflict is solved in favor of the reduction (i.e. associate left)
· If the precedence of the token is higher the shift-reduce conflict is
solved in favor of the shift (i.e. associate right).
· If the precedence of the token is the same than the precedence of the
rule, and is left the shift-reduce conflict is solved in favor of the
reduction (i.e. associate left)
· If the precedence of the token is the same than the precedence of the
rule, and is right the shift-reduce conflict is solved in favor of
the shift
· If the precedence of the token is the same than the precedence of the
rule, and is nonassoc the presence of a shift-reduce conflict means
an error. This is used to describe operators, like the operator
".LT." in FORTRAN, that may not associate with themselves. That is,
because
A .LT. B .LT. C
is invalid in FORTRAN, ".LT." would be described with the keyword
%nonassoc in eyapp.
· The default precedence of a production can be changed using the
"%prec TOKEN" directive. Now the rule has the precedence and
associativity of the specified "TOKEN".
Examples
· By giving token '+' more precedence than token '=' we solve the
ambiguity in "VAR = VAR + NUM" in favor of " VAR = (VAR + NUM)". The
conflict occurs between the productions
exp -> exp . '+' exp
exp -> VAR '=' exp .
Where the dot means:
If I have seen "VAR '=' exp" and I am in the presence of a token '+'
I can associate left, i.e. reduce "VAR '=' exp" to "exp" or to
associate right, that is, to shift to the right to reduce "exp '+'
exp" to "exp" later.
· How it works when two tokens are declared in the same line? Consider
the phrase "NUM - NUM - NUM". It will be interpreted as "(NUM - NUM)
- NUM" if the token '-' is declared "%left '-'" and will be
interpreted as "NUM - (NUM - NUM)" if the token '-' is declared
"%right '-'". By saying '-' is left we are saying we prefer between
the two trees in dispute the one that deepens to the left:
+---+
|exp|
+---+
.--------^--.-----.
+---+ +---+ +---+
|exp| | - | |exp|
+---+ +---+ +---+
.-----+-----. |
+---+ +---+ +---+ +---+
|exp| | - | |exp| |NUM|
+---+ +---+ +---+ +---+
| |
+---+ +---+
|NUM| |NUM|
+---+ +---+
By saying '-' is right we are saying we prefer between the two trees
in dispute the one that deepens to the right:
+---+
|exp|
+---+
.-----.-----^-----.
+---+ +---+ +---+
|exp| |MIN| |exp|
+---+ +---+ +---+
| .-----+-----.
+---+ +---+ +---+ +---+
|NUM| |exp| |MIN| |exp|
+---+ +---+ +---+ +---+
| |
+---+ +---+
|NUM| |NUM|
+---+ +---+
Since priority means earlier evaluation and the evaluation by eyapp
of semantic actions is bottom up, the deeper the associated subtree
the higher the priority.
· Consider now the phrase "-NUM-NUM". There are two interpretations:
one as "-(NUM-NUM)" and the other as "(-NUM)-NUM". The conflict
occurs between the productions
exp -> exp . '-' exp
exp -> '-' exp.
Both productions have the precedence of the token '-'. But we prefer
the interpretation "(-NUM)-NUM" to win. We do that by explicitly
changing the precedence associated with the unary minus production
via the %prec directive.
Lexical Analysis
Parsers created by "eyapp" do not deal directly with the input. Instead
they expect the input to be processed by a lexical analyzer. The
lexical analyzer parses the input and produces the next token. A token
is a pair. The first component is the name of the token (like "NUM" or
"VAR") and the second is its attribute (i.e. the information associated
with the token, like that the value is 4 for a "NUM" or the identifier
is "temperature" for a "VAR"). Tokens are usually defined using regular
expressions. Thus the token "NUM" is characterized by
"/[0-9]+(?:\.[0-9]+)?/" and the token "VAR" by
"/[A-Za-z][A-Za-z0-9_]*/". The eyapp compiler automatically generates a
lexical analyzer from your token definitions. The tokens "NUM" and
"VAR" were defined using the %token directives:
%token NUM = /([0-9]+(?:\.[0-9]+)?)/
%token VAR = /([A-Za-z][A-Za-z0-9_]*)/
The order in which the tokens are defined is important. The input will
be matched against the regular expression for "NUM" before the regular
expression for "VAR" is tried, and all the literal tokens that appear
between quotes inside the body of the grammar, like '+' or "'-", are
tried before any explicitly defined token.
You can, alternatively, define the lexical analyzer explicitly. There
are many ways to do it. Here is an example of a definition of a
lexical analyzer using the %lexer directive:
%lexer {
m{\G[ \t]*}gc;
m{\G(\n)+}gc and $self->tokenline($1 =~ tr/\n//);
m{\G([0-9]+(?:\.[0-9]+)?)}gc and return ('NUM', $1);
m{\G([A-Za-z_][A-Za-z0-9_]*)}gc and return ('VAR', $1);
m{\G(.)}gc and return ($1, $1);
}
The %lexer directive is followed by the code defining the lexical
analyzer. When called, the variable $_ is an alias of the input. The
input can also be set and accessed via the "input" method of the
$parser object.
To catch the next pattern we use the anchor "\G". The "\G" anchor
matches at the point where the previous "/g" match left off. Normally,
when a scalar "m{}g" match fails, the match position is reset and "\G"
will start matching at the beginning of the string. The "c" option
causes the match position to be retained following an unsuccessful
match. The couple "('',undef)" which signals the end of the input is
automatically inserted by "eyapp".
By default, the lexers generated by "eyapp" emit the end-of-input token
"('', undef)" when the end of the current string is reached. A
incremental lexer differs from these behavior: when the end is reached
it reads more input from the current file, which was set by
$parser->YYInputFile
See the following variant of the synopsis example:
~/LEyapp/examples/eyappintro$ cat InputFromStream.eyp
%whites /([ \t]+)/
%token NUM = /([0-9]+(?:\.[0-9]+)?)/
%token VAR = /([A-Za-z][A-Za-z0-9_]*)/
%right '='
%left '-' '+'
%left '*' '/'
%left NEG
%defaultaction { "$_[1] $_[3] $_[2]" }
# example of incremental lexer
%incremental lexer 'Write an arithmetic expression: '
%%
input: {}
| input line {}
;
line: '\n' {}
| exp '\n' { print "$_[1]\n" }
| error '\n' {}
;
exp: NUM { $_[1] }
| VAR { $_[1] }
| VAR '=' exp
| exp '+' exp
| exp '-' exp
| exp '*' exp
| exp '/' exp
| '-' exp %prec NEG { "$_[2] NEG" }
| '(' exp ')' { $_[2] }
;
%%
Now, after the grammar is compiled
~/LEyapp/examples/eyappintro$ eyapp -C InputFromStream.eyp
When the generated modulino is executed, each time the end of the input
string is reached, it asks for more input until we press the end-of-
file ("^D" in Unix) key:
~/LEyapp/examples/eyappintro$ ./InputFromStream.pm -noslurp
Write an arithmetic expression: a=2+3*b
a 2 3 b * + =
Write an arithmetic expression: a=-b*2
a b NEG 2 * =
Write an arithmetic expression: ^D
~/LEyapp/examples/eyappintro$
The "main" and "error" subroutines
If you compile your grammar with option "-C", "eyapp" will insert a
line like this as the first line of the generated ".pm" file:
#!/usr/bin/perl
It will also append a line like this as the last line of the ".pm"
file:
unless (caller) { exit !__PACKAGE__->main(''); }
This allows the alternative use of the module as a script. Unless a
"main" subroutine was defined, the one provided by Parse::Eyapp::Driver
will be called. It also provides a default subroutine for the handling
of error messages.
The default main accepts a few arguments from the command line. Here
are some:
· "-f filename" input from "filename"
· "-c 'string'" input from 'string'
· "-noslurp" when input is from "STDIN" don't wait for end of file
SEE ALSO
· The project home is at http://code.google.com/p/parse-eyapp/
<http://code.google.com/p/parse-eyapp/>. Use a subversion client
to anonymously check out the latest project source code:
svn checkout http://parse-eyapp.googlecode.com/svn/trunk/ parse-eyapp-read-only
· The tutorial Parsing Strings and Trees with "Parse::Eyapp" (An
Introduction to Compiler Construction in seven pages) in
<http://nereida.deioc.ull.es/~pl/eyapsimple/>
· Parse::Eyapp, Parse::Eyapp::eyapplanguageref,
Parse::Eyapp::debuggingtut, Parse::Eyapp::defaultactionsintro,
Parse::Eyapp::translationschemestut, Parse::Eyapp::Driver,
Parse::Eyapp::Node, Parse::Eyapp::YATW, Parse::Eyapp::Treeregexp,
Parse::Eyapp::Scope, Parse::Eyapp::Base,
Parse::Eyapp::datagenerationtut
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/languageintro.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/debuggingtut.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/eyapplanguageref.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/Treeregexp.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/Node.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/YATW.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/Eyapp.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/Base.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/translationschemestut.pdf>
· The pdf file in
<http://nereida.deioc.ull.es/~pl/perlexamples/treematchingtut.pdf>
· perldoc eyapp,
· perldoc treereg,
· perldoc vgg,
· The Syntax Highlight file for vim at
<http://www.vim.org/scripts/script.php?script_id=2453> and
<http://nereida.deioc.ull.es/~vim/>
· Analisis Lexico y Sintactico, (Notes for a course in compiler
construction) by Casiano Rodriguez-Leon. Available at
<http://nereida.deioc.ull.es/~pl/perlexamples/> Is the more
complete and reliable source for Parse::Eyapp. However is in
Spanish.
· Parse::Yapp,
· Man pages of yacc(1) and bison(1),
<http://www.delorie.com/gnu/docs/bison/bison.html>
· Language::AttributeGrammar
· Parse::RecDescent.
· HOP::Parser
· HOP::Lexer
· ocamlyacc tutorial at
http://plus.kaist.ac.kr/~shoh/ocaml/ocamllex-ocamlyacc/ocamlyacc-tutorial/ocamlyacc-tutorial.html
<http://plus.kaist.ac.kr/~shoh/ocaml/ocamllex-ocamlyacc/ocamlyacc-
tutorial/ocamlyacc-tutorial.html>
REFERENCES
· The classic Dragon's book Compilers: Principles, Techniques, and
Tools by Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman (Addison-
Wesley 1986)
· CS2121: The Implementation and Power of Programming Languages (See
<http://www.cs.man.ac.uk/~pjj>,
<http://www.cs.man.ac.uk/~pjj/complang/g2lr.html> and
<http://www.cs.man.ac.uk/~pjj/cs2121/ho/ho.html>) by Pete Jinks
CONTRIBUTORS
· Hal Finkel <http://www.halssoftware.com/>
· G. Williams <http://kasei.us/>
· Thomas L. Shinnick <http://search.cpan.org/~tshinnic/>
· Frank Leray
AUTHOR
Casiano Rodriguez-Leon (casiano@ull.es)
ACKNOWLEDGMENTS
This work has been supported by CEE (FEDER) and the Spanish Ministry of
Educacion y Ciencia through Plan Nacional I+D+I number
TIN2005-08818-C04-04 (ULL::OPLINK project <http://www.oplink.ull.es/>).
Support from Gobierno de Canarias was through GC02210601 (Grupos
Consolidados). The University of La Laguna has also supported my work
in many ways and for many years.
A large percentage of code is verbatim taken from Parse::Yapp 1.05.
The author of Parse::Yapp is Francois Desarmenien.
I wish to thank Francois Desarmenien for his Parse::Yapp module, to my
students at La Laguna and to the Perl Community. Thanks to the people
who have contributed to improve the module (see "CONTRIBUTORS" in
Parse::Eyapp). Thanks to Larry Wall for giving us Perl. Special
thanks to Juana.
LICENCE AND COPYRIGHT
Copyright (c) 2006-2008 Casiano Rodriguez-Leon (casiano@ull.es). All
rights reserved.
Parse::Yapp copyright is of Francois Desarmenien, all rights reserved.
1998-2001
These modules are free software; you can redistribute it and/or modify
it under the same terms as Perl itself. See perlartistic.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
perl v5.12.5 2011-02-16 Parse::Eyapp::eyappintro(3)