Elkhound: A GLR Parser Generator
and
Elsa: An Elkhound-based C++ Parser
Elkhound is a parser generator, similar to Bison. The
parsers it generates use the Generalized LR (GLR) parsing algorithm.
GLR works with any context-free grammar, whereas LR parsers
(such as Bison) require grammars to be LALR(1).
Parsing with arbitrary context-free grammars has two key advantages:
(1) unbounded lookahead, and (2) support for ambiguous grammars.
Unbounded lookahead is achieved by allowing multiple potential parses
to coexist for as long as necessary. Similarly, ambiguity is handled
by letting potential parses be coalesced, with special action taken
to handle the ambiguous region of the input. In general, by using the
GLR algorithm, Elkhound avoids the difficulty of trying to make a
language grammar fit the LALR(1) restrictions.
What's more, the Elkhound implementation of GLR is competitive with
good LR implementations (particularly Bison) for grammars and grammar
fragments that are LALR(1). Its high performance is a result of a
novel combination of GLR and LR parsing, wherein the parser decides on
a token-by-token basis which algorithm to use. Programmers can start
with a grammar that is convenient (but perhaps slower to parse with),
and gradually modify it to conform to LALR(1) in places for improved
performance. Grammars that are already LALR(1) will work in Elkhound
unmodified (though the input syntax is different), and with parsing
speeds within a few percent of Bison-generated parsers.
The reason I wrote Elkhound is to be able to write a C++ parser.
The parser is called Elsa, and is included in the distribution below.
Elkhound is written in C++, and can generate parsers written in either
C++ or Ocaml. Elsa is written entirely in C++, and parses C and C++
input code.
Documentation
- smbase, my utility library.
- ast, a system for making abstract syntax trees.
- elkhound, a GLR parser generator.
- elsa, a C/C++ parser.
Downloads
Elkhound and Elsa source code,
version 2005.08.22b.
Released under the
BSD License.
Note that you need
a C++ compiler such as GCC,
the lexical anayzer Flex,
and perl 5 for the build process
(these are already installed on most unix systems).
To see which systems this version has been tested on, please see the
success/failure matrix.
2006-12-09: More recent releases can be found on the
Oink page.
CVS repositories via
viewcvs:
smbase,
ast,
elkhound,
elsa.
(Please be gentle on this server!)
The very latest versions are available here, but these versions are not
as thoroughly tested (esp. on non-Linux systems) as the release tarballs.
Technical report describing Elkhound,
its implementation, programming interface, input syntax, and
performance measurements. Updated 2002-12-28.
2004-09-15: Note that there is a
bug in reduceViaPath in the
technical report.
How much C++ can Elsa parse?
(Info as of 2005-08-22.)
Elsa can parse most C++ "in the wild".
It has been tested with some notable large programs,
including Mozilla,
Qt,
ACE,
and itself. I have not tried parsing
KDE recently, so that's
the next major goal.
Elsa supports most of the GCC extensions, and several important
GCC bugs. Currently, gcc-3.4.x is the main compatibility target;
GCC bugs that are fixed in 3.4 are (in most cases) not supported.
It also supports a couple of MSVC bugs, but no extensions (other than
those that overlap with GCC extensions), and no bugs/extensions for
other compilers. Supported bugs are documented in
sources/elsa/cc_lang.h.
Elsa has recently gained the ability to parse all of the gcc-3.4.3
standard library headers (except for valarray, which uses
template template parameters). Previously, the recommended practice
was to
preprocess your program using gcc-2.95.3,
but it now should
work to use the 3.x headers, especially 3.4.3 or later.
Unfortunately, the headers that come with gcc-3.3.x and earlier contain
bugs, and Elsa does not currently emulate the gcc bugs that tolerate them.
(Note that the 3.x headers are much more complicated than the
2.x headers, so
they take significantly longer to parse and do template instantiation.)
I have not yet tried the 4.x headers.
The one major feature that is not implemented at all is template
template parameters and arguments (passing entire, uninstantiated
templates as arguments to other templates). The only place I have
seen them used is in the valarray header; let me know if
this feature is a priority to you.
Additionally, there are many known "minor" bugs. See
elsa/regrtest
for details. The known failures are mostly pretty esoteric; in
a few cases, they are (valid) inputs that no front end I know of can handle.
If you have a program you want to parse, and Elsa fails to, then it is
likely that a small "point fix" can be found quickly to solve the
problem. Send me a preprocessed .i file if you want help. Even better,
use Delta to minimize the
input (such that, say, gcc-3.4 accepts it but Elsa does not). Sending me
minimized test cases that cause Elsa to fail is the among the best ways
to help me make Elsa better!
How much C can Elsa parse?
Elsa supports parsing C by basically parsing it as C++ and then
applying looser typechecking rules (the details are more complicated,
but that is the basic approach). This support is enabled with the
"-tr c_lang" argument to ccparse.
In C mode, Elsa can parse most C programs, including the Linux kernel
(our highest-priority C program). It handles most gcc extensions,
including K&R function definitions and the "implicit int" rule.
There is a good chance it will parse your C program.
However, since Elsa still applies C++ rules in some places in C mode,
Elsa may reject valid but "funky" C programs. Support for the looser
C rules is being gradually added, and is usually easy to add; let me
know if you want help parsing some C program.
Relationship between Elsa, Elkhound, and CIL; and Ocaml vs C++
CIL, the
C Intermediate Language, is a C front-end written in
Ocaml.
George Necula,
Wes Weimer and I wrote
CIL as infrastructure for the
CCured
("see-cured") project, which is a source-to-source translator for C that
adds run-time checks for memory safety. CCured is also written in Ocaml.
Elsa and Elkhound are, at the moment, completely separate from CIL.
I wrote them because our experiments extending CIL to handle C++ were
unsuccessful. I chose to write them in C++ instead of Ocaml because
I prefer C++; the rest of the group prefers Ocaml, which is why CIL and
CCured use that language instead.
Long term, I would love to find a good way to merge Elsa and CIL. At
the very least it should be possible to use Elsa to parse C++ into an
AST, and then feed that to CIL so those that want to can write their
analysis in Ocaml and the rest of the CIL infrastructure. Even better,
I would like to make it possible to move ASTs back and forth among
several tools (possibly written in different languages), retaining not
just the AST forms but the annotations computed by prior analyses.
There is ongoing effort to export the Elsa AST as an XML document;
we expect to be able to advertise this in the next public release.
Contact
Are you using Elkhound? Do you want to be notified of new
releases? Drop me a line:
smcpeak cs berkeley edu
@ . .
If you want to report a bug in Elsa, please see the
bug reporting guidelines for Elsa.
2006-12-09: It has been quite some time since I have had a lot
of time to work on Elsa. The current version of Elsa is being hosted
at www.cubewano.org by
Karl Chen, and he and Daniel Wilkerson are doing most of the
maintenance. The most active page is the
Oink page; Oink is a project
that includes Elsa. I plan to move the Elsa pages to someplace else
reasonably soon.
Authors
- Original author, primary maintainer: Scott McPeak
- Major contributions: Daniel Wilkerson
- Patches/fixes:
- Wes Weimer
- Ben Liblit
- Jeff Foster
- Jim Nichols
- Bill McCloskey
- Bug reports:
- Altac "Emmanuel" Edena
- Karl Chen
(See also Elsa Contributors.)
Some other projects using Elkhound:
Changes
Major changes in version 2005.08.22b
(matrix):
- The only change is the timestamps on smbase/{o,}bjlist.h.
In the previous version's tarball,
they were set as older than smbase/xobjlist.h, causing people to need
to have M4, which I want to avoid.
Major changes in version 2005.08.22
(matrix):
- XML serialization is now functional! See
elsa/doc/serialization.txt
for more details.
- Rewrote type matcher, creating mtype module, which now handles
prior responsibilities of the old type matcher and the type equality checker.
- Fixed many (~80) bugs in a push to get Debian to go through Elsa.
- GNU __attribute__s are now retained in some cases (previously they
had been parsed but discarded), and infrastructure is in place to retain them
everywhere.
- Began writing
C++ Entities and Relationships,
an attempt to semi-formalize key static semantic concepts. It's still
evolving, but already a source of significant insights.
Major changes in version 2005.05.28
(matrix):
- Plenty of bug fixes.
- dsw: Separation of Types and abstract values.
- Added support for GNU __complex__ and C99 _Complex
and _Imaginary.
- Pushed ACE version
5.3.1 through Elsa.
- Pushed the remaining gcc-3.4.3 headers (except for valarray)
through Elsa. Whew!
Major changes in version 2005.05.03
(matrix):
- Fix for Redhat's flex-2.5.4a-29.
- Support for mingw32 cross-compile (just in smbase for now).
- Some MSVC fixes/hacks.
- Improved parsing of embedded ML reduction action code fragments.
- Added support for matching dependent qualified types.
- Numerous (>50) fixes for bugs reported by Karl Chen.
- dsw: Some refactoring of relationship between visitors and template
instantiations.
- dsw: Removed the 'cloneType' calls, and the 'loc' parameter on type
factory methods.
Major changes in version 2005.03.03
(matrix):
- Lots of bug fixes, thanks to reports sent by Altac Edena.
- Rewrote name lookup.
- Began process of evolving string class towards compatibility with std::string.
- (Elkhound) Fixed a bug in precedence resolution of reduce/reduce conflicts.
Major changes in version 2005.02.12
(matrix):
- Implemented argument-dependent name lookup as specified in section
3.4.2 of the standard.
- dsw: Added canonical insertion of "this->" where otherwise
implicit.
- Numerous fixes for cygwin when it is set to use DOS line endings.
Major changes in version 2004.11.19
(matrix):
- None; just renamed the tarball to coincide with the release of
Oink.
Major changes in version 2004.11.13
(matrix):
- Fixed numerous bugs. Mozilla once again works, as does Qt and a few other
packages. Also, confirmed Elsa can parse itself.
- Added support for incomplete enums
in gcc mode.
- Added info about declarator parsing.
- Operator overload resolution is now quite stable; turned it on by default.
- Implemented checking of arguments vs. parameters, to avoid the phenomenon
where mis-computed types only show up in complicated overloading scenarios.
- dsw: Added support for C99 hex floating literals.
- Implemented proper type computation for the ?: operator (cppstd 5.16).
- Changed default handling of operator= to match what gcc and icc do, despite
that being quite different from what is specified in cppstd.
- Overload resolution now handles sets that contain both static and nonstatic
member functions.
- Ambiguity resolution now handles an arbitrary number of alternatives
at a single AST node.
- Moved the K&R support from Oink into Elsa.
Major changes in version 2004.08.21
(matrix):
- Pulled the flex output back out, since it is dependent on the set of
active extension modules. I bet I broke flex-2.5.31 support again...
- Split out ReferenceType from PointerType,
and D_reference from D_pointer.
- dsw: Implemented missing conditional second operand (GNU extension).
- dsw: Nested functions implemented (GNU extension).
- Overload resolution now defaults to being enabled.
- Massive rewrite of template instantiation internals.
- Implemented explicit template instantiation.
- dsw: Elaborated "enum" lookups now look in base classes too.
- dsw: Switched to using a better hash function in some places.
- Added an Elsa tutorial.
- Partially implemented "dependent" vs. "non-dependent" lookup.
- Fixed lookup scope order w.r.t. 14.6.1.
- Binary operator overload resolution now stores the proper
result type in E_binary even for builtin candidates.
- Much better __attribute__ support.
- Lots of smaller bugfixes.
Major changes in version 2004.02.13
(matrix):
- I now distribute the output of running flex and bison so that users
do not have to have either, or the same version I have. Among other
things, this solves the incompatibility with flex-2.5.31.
Major changes in version 2004.02.07
(matrix):
- fixed several portability bugs, including the Solaris crash in srcloc
Major changes in version 2004.02.06
(matrix):
- (Elsa) overload resolution is implemented and used to resolve calls
- (Elsa) operator overload resolution is performed, and syntax rewritten
to clarify which function is called, in most contexts
- (Elsa) Daniel Wilkerson added an optional elaboration pass to make
explicit otherwise implicit calls to constructors, destructors,
and to add definitions of compiler-supplied functions (e.g.
copy ctor).
- (Elsa) implemented namespaces
- (Elsa) factor E_this out of E_variable; clarified other 'this' issues
- (Elsa) annotate Declarators with their syntactic context, as they appear
in many places with substantially varying meanings
- (Elsa) bugfix: handle ambiguity: "if (C * c = 5) ..."
- (Elsa) bugfix: correct scoping for "x" in "if (...) int x = ...;"
- (Elkhound) Ocaml backend (see elkhound/ocaml directory)
- (Elkhound) misc. portability fixes
- (Elkhound) support for "subset directives", which can be used to implement
scannerless parsers (TODO: write up how to use it)
- (astgen) member modifiers like "field" and "owner"
- (astgen) handle enums in .ast file similarly to classes
Major changes in version 2003.08.16:
- (Elkhound) Created a
tutorial to help
get started with Elkhound. It guides the new user from grammar
creation through run-time disambiguation.
- (Elsa) C++ operator overloading, including some
gymnastics to handle built-in candidates
(see convertibility.txt),
is finished.
It's still not enabled by default, however, because of some unfinished
parts of overloading generally.
Major changes in version 2003.07.22:
- Incorporated cygwin fixes from Jim Nichols.
- (Elkhound) Added very rudimentary error diagnosis: at a parse
error, the last "parser" to die will print the list of tokens that
would have allowed it to continue (the expected tokens).
- (Elsa) Many minor bug fixes.
- (Elsa) Partial implementation of template instantiation.
- (Elsa) Implemented modules to do standard conversions, implicit
conversions, and overload resolution. Nothing calls into these
modules, yet, except some testing infrastructure.
- (Elsa) Initial operator overloading implementation: when a call to
an operator function is detected, the AST is rewritten to be as if
the user used the explicit function call notation. (So far only
binary operators are implemented.)
- (Elsa) Added some support for GNU and C99 extensions: attributes,
inline assembly, designated initializers. It's now close to being
able to parse all of Linux.
- (Elsa) Added a module to compute the control flow graph among the
statements within a function body.
Major changes in version 2003.01.10:
- Portability fixes in the build process.
Major changes in version 2003.01.06:
- Lots of small enhancements and fixes to the C++ parser. It's almost
as fast as g++ now.
- Added support for parsing C too.
- Rewrote the lexer (the old one was slow and stupid).
- Renamed the distribution "Elsa".
- Rewrote configure scripts in perl.
- Implemented extension modules for lexer, parser, and AST system.
- Bunches of other good stuff I can't remember right now.
Major changes in version 2002.10.14:
- Finished the C++ parser. The only major missing feature is namespaces.
The parser test program is called cc/ccparse.
- Fixed a bug where keep() wasn't being called in the deterministic
(ordinary LR) parser core.
Major changes in version 2002.10.04:
- Added the beginnings of a C++ parser.
- Inverted the order of precedence specifications.
- The start rule and production are now implicit and automatic.
- Renamed gramanl to elkhound,
libglr.a to libelkhound.a.
Major changes in version 2002.09.16:
- Separated most of the testing and example grammars into their
own subdirectories, to simplify the toplevel parsgen/ organization.
- Removed a bunch of files that are no longer used.
- Created a grammar specifically as an example to learn from and
modify:
parsgen/examples/arith
.
- A few performance tweaks.
Original public release: 2002.08.12.
Related Links
I've collected here some links to related projects.
I don't claim these lists are exhaustive, but I'm happy to add links
you want to send me. For parser generators, I include both the
underlying parsing algorithm, and the license under which it is
available. (GPL=GNU
General Public License, LGPL=GNU Lesser General Public
License, BSD=BSD
License)
Other C++ Parsers. Here are some alternatives to Elsa. The main
advantages I would claim for Elsa are ease of adding language extensions,
a common activity in a research setting, and reasonably complete support
for the C++ language, especially templates. (Note that "instantiating"
templates is a necessary part of parsing code that uses templates.)
- The GNU Compiler Collection, aka GCC.
g++ used a Bison grammar prior to gcc-3.4. Starting with gcc-3.4,
it uses a hand-written recursive-descent parser. Its parser is more
mature than Elsa's. However, gcc is a
compiler, and its intermediate language may not be ideal for analyses
other than code generation. (GPL)
- GCC_XML is an
extension to GCC to have it emit some of the GCC AST as XML. Currently
it does not emit information about function bodies; it records
global declarations, struct contents, etc. (BSD)
- The Open C++ parser
was designed to allow run-time extensions, sort of like browser plugins.
It uses a hand-written recursive-descent parser. Its handling of
templates is currently (2005-06-03) less capable than Elsa's.
(Custom
license, similar to BSD.)
- Keystone
contains a parser written in BtYacc.
It does not do template instantiation. However, the page has some interesting
papers on the implementation of name lookup. (GPL)
- Edison Design Group licenses several
compiler front ends, including a hand-written recursive-descent C++ parser.
The EDG parser is very robust, and is the only front end I know of that
implements the entire C++ language, including the export keyword.
Its only drawbacks are the commercial license, which I find to be an
impediment to research impact, and the difficulty of adding and composing
extensions, since the parser specification is imperative code instead of
a grammar. (Commercial, but academic licensing is possible.)
- John Lilley's PCCTS-based C++ Parser
contains a C++ grammar for PCCTS (predecessor of
ANTLR). It does not appear to have
been updated since 1997. (Public domain.)
- The Eclipse IDE contains a C++ parser.
However, it is just doing a heuristic best effort without access to
crucial contextual information from headers, and so will mis-parse some
syntax. (Custom license,
similar to BSD. OSI-approved version
here.)
- KDevelop also contains a C++ parser.
I have not done much investigation of its capabilities. (GPL)
- OpenWatcom is an open-source version
of the Watcom C++ compiler. The parser is hand-written, recursive-descent.
(Custom OSI-approved
license.)
- Ultimate++ is Windows and Linux
IDE for C++. It contains a hand-written recursive-descent C++ parser,
comparable to Eclipse's: simple and heuristic, no template instantiation.
(Custom, vaguely BSD-ish license.)
- The FOG -
Flexible Object Generator contains a metacompiler for C++. It includes
a C++ parser based on a backtracking variant of Bison with ad-hoc ambiguity
resolution, some of which is deferred until semantic analysis (similar to
Elsa). I have not attempted to evaluate its completeness.
Edward
Willink's PhD thesis describes the design of the parser, and includes
an interesting discussion of the difficulties of parsing C++ and the resolution
techniques used in FOG. (License unknown.)
- Synopsis contains a
hand-written parser based on that of OpenC++. (LGPL)
GLR Parser generators. These are all of the available
systems I know of that use the GLR algorithm.
- The ASF+SDF
Meta-Environment integrates lexical, syntactic, and semantic analysis
into a single environment and framework. It uses a scannerless GLR
parser. (LGPL)
- Tom:
C Implementation of the Tomita (aka GLR) parsing algorithm. This
implementation just prints the parse tree and exits. (Public
domain)
- Harmonia
is "an open, extensible framework for constructing interactive,
language-aware programming tools." It uses an incremental GLR
parser. (Binary-only release at this time.)
- Elkhound
is a GLR parser generator. It tries to be as fast as Bison for
LALR(1) grammar fragments while also competitive with other GLR
implementations for nondeterministic portions of input. (BSD)
- Bison,
as of version 1.50, now includes a GLR parser, implemented
by Paul Hilfinger.
You request it with the %glr-parser directive, and further
documenation is in the info files.
Among my TODOs is to do some experimentation with this implementation.
(GPL)
- DParser
uses a scannerless GLR implementation. It supports user-specified
reduction actions, and includes an example grammar for C. I haven't
examined it in detail. (BSD)
- Fjavac
is a Java parser written in Ocaml using the GLR algorithm with
extensions for ordered productions (for certain kinds of
disambiguation specifications). It is part of a larger project
to formally specify Java. (License unspecified)
- JBP,
the Java Boolean Parser, is a parser generator for Java (the
generated parser is Java source code) written in Java. It
uses a scannerless GLR implementation with support for
boolean grammars.
The parser generates a parse tree, rather than supporting arbitrary
user-defined reduction actions. (BSD)
Parser generators based on algorithms other than GLR but which
support multiple tokens of lookahead (or some other way of handling
nondeterministic grammars):
- Accent is parser generator
based on a combination of the Earley and LL parsing algorithms. (GPL)
- BtYacc (backtracking
YACC) is based on YACC, but adds support for backtracking. Thus,
it allows for unbounded lookahead, but does not allow ambiguity to
be retained after parsing. (Public domain)
- ANTLR is an LL(k) parser generator
with built-in lexical analyzer. (Public domain)
- PRECC (PREttier Compiler
Compiler) is an LL(infinity) parser generator. (Custom license;
source is available; see its license.doc file.)
Parser generators that use deterministic algorithms (e.g. LALR(1) or
LL(1)). Since there are many such generators available, I attempt
here only to list those that are popular, actively maintained, or
otherwise relevant to Elkhound or the compiler community at large.
Also, I'm only listing those with C language interfaces (more or less
arbitrary decision, to cut down on the size).
- Bison
is a very popular LALR(1) parser generator. (GPL)
- YACC (Yet Another
Compiler-Compiler) was among the first widely-used parser generators,
in part because it was distributed with early versions of UNIX. It
generates LALR(1) parsers. (Availability/license unknown)
- AnaGram is an LALR(1)
parser generator with some support for built-in lexical analysis.
It also has several accompanying tools for grammar debugging.
(Commercial)
- ProGrammar
is another parser generator with accompanying tools to help
development. I can't tell from their website what parsing
algorithm it uses. (Commerical)
- Lemon is an LALR(1)
parser generator that claims to be faster than Bison. It
supports "nonterminal destructors", which are similar to
Elkhound's del() actions. (GPL)
Collections of tools and/or links:
Acknowledgements
This work is based upon work supported in part by the
National Science Foundation
under Grants No. CCR-9875171, CCR-0085949, CCR-0081588,
CCR-0085949, CCR-0326577 and CCR-0234689;
by the
National Aeronautics and Space Administration
under Grant No. NNA04CI57A;
and gifts from
Microsoft Research.
Any opinions, findings, and
conclusions or recommendations expressed in this material are those
of the author(s) and do not necessarily reflect the views of the
National Science Foundation or the other sponsors.
This support has been made available through, and supplemented by, the
University of California, Berkeley,
the
Open Source Quality Project and
my research advisor and collaborator,
George Necula.
Without this support, this work would not have been possible.