Elkhound Algorithm
This page describes some of the details of the variant of GLR that
Elkhound uses, placing it in context with some of the other GLR
variants that have been proposed.
Goals
First, the primary goals of Elkhound:
- It must have the capability to execute arbitrary user-provided
reduction actions. Building a parse tree isn't good enough, it
takes too much time and space.
- Reductions and merges should be performed bottom-up (unless the
grammar is cyclic).
- It should be competitive with Bison for LALR (fragements of)
grammars, and degrade gracefully from there. On the scale of
grammar nondeterminism, from none (LALR) to some to lots, "some"
is the niche Elkhound is going after.
These goals are driven by Elkhound's primary application, the
Elsa C++ Parser. In essence, Elkhound came about because I wanted
to apply automatic parsing technology to parsing C++, but found
exiting systems inadequate.
History
The approximate algorithm descendency sequence:
- Bernard Lang is typically credited with the original GLR idea:
Deterministic Techniques for Efficient Non-deterministic Parsers.
Bernard Lang.
Automata, Languages and Programming, Springer, 1974.
- Later, Tomita published the algorithm with the intent of using it
for natural language processing. He popularized the term
"Generalized LR Parsing", or GLR.
Efficient Parsing for Natural Language.
Masaru Tomita.
Int. Series in Engineering and Computer Science, Kluwer, 1985.
- Tomita's algorithm fails for some grammars with epsilon rules.
Farshi proposed a fix involving doing a GSS search after some
reductions.
GLR Parsing for epsilon-grammars.
Rahman Nozohoor-Farshi.
In Generalized LR Parsing, Kluwer, 1991.
- Rekers adapted Farshi's solution for use in
ASF+SDF.
Rekers added
parse table construction to what was otherwise a recognizer.
His algorithm was what I based the original Elkhound
implementation on.
Parser Generation for Interactive Environments.
Jan Rekers.
PhD thesis, University of Amsterdam, 1992.
- When straightforwardly modified to execute user actions, the Rekers
algorithm does not always do merges and reductions bottom-up, which makes
using it in a parser much harder. Further,
it is slower than Bison (by about a factor of 10) even on LALR grammars.
George Necula and I remedied these deficiencies while building the Elkhound GLR parser
generator.
Elkhound: A Fast, Practical GLR Parser Generator.
Scott McPeak and George C. Necula.
In Proceedings of Conference on Compiler Constructor (CC04), 2004.
Right Nulled GLR
At the CC04 conference I became acquainted with Adrian
Johnstone and his work. He and his coathors have not only more
thoroughly documented the history of GLR, but also proposed a novel
alternative solution to the problem Farshi originally addressed,
which they call "Right Nulled GLR Parsing".
The basic idea of Right Nulled GLR is, rather than re-examining work
already done to check for newly exposed reduction opportunities (as
Farshi does), do those reductions that would be found by the search
earlier in the parse, namely as soon as the rule in question only
has nullable components remaining to recognize.
However, while this approach is certainly appealing, I still have
questions about exactly how to adapt it to use user-defined reduction
actions. At some point I want to implement the right-nulled variant
in Elkhound to experiment more with it, but haven't gotten around to it.
Algorithm Features
The
Technical Report
has most of the details, so I'll just point out some key distinguishing
characteristics here.
- Hybrid LR/GLR.
The Elkhound Graph-Structured Stack carries
sufficient information (in the form of the "deterministic depth") to
tell when a given parse action (shift or reduce) can be performed
using the ordinary LR algorithm. Essentially, if there isn't any
nondeterminism near the stack top, LR can be used. LR actions are
much faster to execute, so this leads to a big performance win (a
factor of 5 to 10) when the input and grammar are mostly deterministic.
- User-defined Actions. In addition to the usual
reduction actions (such as with Bison), Elkhound exposes actions
to create and destroy semantic values (dup and del), and an action
to merge ambiguous regions (merge). The user can then do whatever
is desired in these actions, typically building an abstract syntax
tree. In contrast, most other GLR implementations build a parse
tree, which is quite expensive. (Parse trees tend to be at least
10 times larger than abstract syntax trees.)
- Reduction Worklist. As alluded to above, the
Rekers algorithm will sometimes mix up the order of reduces and
merges. The problem is with the granularity of the GSS Node
worklist; it forces certain actions (all those associated with the
node) to happen together, even when something else should happen
in between. Elkhound solves this by maintaining a finer-grained
worklist of reduction opportunties, and keeps this worklist sorted
in a specific order, such that a bottom-up reduction/merge order
will always be used when possible. (With a cyclic grammar, there
may be a cycle in the reduction order, precluding bottom-up
execution. Elkhound reports cyclic grammars to the user, but then
goes ahead and parses with them.)
- Reference-Counted GSS. This is mostly a minor
detail, but using reference counting in the GSS produces much
better locality than garbage collection alone (and manual
deallocation is impossible due to the nature of the algorithm).
- C++ and OCaml Interfaces. Though not a feature of
the parsing algorithm per se, a nice feature is the ability to let
the user write actions in either C++ or OCaml. There are actually
two parser core impementations, a C++ implementation and an OCaml
implementation, and the user simply links Elkhound's output with
the appropriate parser core.