Elkhound Manual

This page describes the input format for grammars for the Elkhound parser generator, and the features of that generator.

Related pages:

If you'd like to look at a simple grammar while reading this description, see examples/arith/arith.gr, a parser for simple arithmetic expressions.

1. Lexical structure

The grammar file format is free-form, meaning that all whitespace is considered equivalent. In the C tradition, grouping is generally denoted by enclosing things in braces ("{" and "}"). Strings are enclosed in double-quotes ("").

Grammar files may include other grammar files, by writing include("other_file_name").

Comments can use the C++ "//" syntax or the C "/**/" syntax.

2. Context Class

The parser's action functions are all members of a C++ context class. As the grammar author, you must define the context class. The class is introduced with the "context_class" keyword, followed by ordinary C++ syntax for classes (ending with "};");

3. Terminals

The user must declare of all the tokens, also called terminals. A block of terminal declarations looks like:
  terminals {
    0 : TOK_EOF;
    1 : TOK_NUMBER;              // no alias
    2 : TOK_PLUS     "+";        // alias is "+" (including quotes)
    3 : TOK_MINUS    "-";
    4 : TOK_TIMES    "*";
    5 : TOK_DIVIDE   "/";
    6 : TOK_LPAREN   "(";
    7 : TOK_RPAREN   ")";
  }

Each statement gives a unique numeric code (e.g. 3), a canonical name (e.g. TOK_MINUS), and an optional alias (e.g. "-"). Either the name or the alias may appear in the grammar productions, though the usual style is to use aliases for tokens that always have the same spelling (like "}"), and the name for others (like TOK_NUMBER).

Normally it's expected the tokens will be described in their own file, and the make-tok script will create the token list seen above.

3.1 Token Types

In addition to declaring the numeric codes and aliases of the tokens, the user must declare types for semantic values of tokens, if those values are used by reduction actions (specifically, if their occurence on a right-hand-side includes a label, denoted with a colon ":").

The syntax for declaring a token type is

token(type) token_name;
or, if specifying terminal functions,
token(type) token_name {
      terminal_functions
}

The terminal functions are explained in the next sections.

3.2 dup/del

Terminals can have dup and del functions, just like nonterminals. See below for more information.

3.3 classify (advanced)

In some situations, it is convenient to be able to alter the classification of a token after it is yielded by the lexer but before the parser sees it, in particular before it is compared to lookahead sets. For this purpose, each time a token is yielded from the lexer, it is passed to that token's classify() function. classify accepts a single argument, the semantic value associated with that token. It returns the token's new classification, as a token id. (It cannot change the semantic value.)

The main way it differs from simply modifying the lexer is that the classify function has access to the parser context class, whereas the lexer presumably does not. In any case, it's something of a hack, and best used sparingly.

As a representative example, here is the classify function from c/c.gr, used to implement the lexer hack for a C parser:

  token(StringRef) L2_NAME {
    fun classify(s) [
      if (isType(s)) {
        return L2_TYPE_NAME;
      }
      else {
        return L2_VARIABLE_NAME;
      }
    ]
  }

4. Nonterminals

Following the terminals, the bulk of the grammar is one or more nonterminals. Each nonterminal declaration specifies all of the productions for which it is the left-hand-side.

A simple nonterminal might be:

  nonterm(int) Exp {
    -> e1:Exp "+" e2:Exp        { return e1 + e2; }
    -> n:TOK_NUMBER             { return n; }
  }

The type of the semantic value yielded by the productions is given in parentheses, after the keyword "nonterm". In this case, int is the type. The type can be omitted if productions do not yield interesting semantic values.

In the example, Exp has two productions, Exp -> Exp "+" Exp and Exp -> TOK_NUMBER. The "->" keyword introduces a production.

Right-hand-side symbols can be given names, by putting the name before a colon (":") and the symbol. These names can be used in the action functions to refer to the semantic values of the subtrees (like Bison's $1, $2, etc.). Note that action functions return their value, as opposed to (say) assigning to $$.

There are four kinds of nontermial functions, described below.

4.1 dup

Because of the way the GLR algorithm operates, a semantic value yielded (returned) by one action may be passed as an argument to more than one action. This is in contrast to Bison, where each semantic value is yielded exactly once.

Depending on what the actions actually do, i.e. what the semantic values actually mean, the user may need to intervene to help manage the sharing of semantic values. For example, if the values form a tree where memory is managed by reference counting, then the reference count of a value would need to be increased each time it is yielded.

The dup() nonterminal function is intended to support the kind of sharing management alluded to above. Each time a semantic value is to be passed to an action, it first is passed to the associated dup() function. The value returned by dup() is stored back in the parser's data structures, for use the next time the value must be passed to an action. Effectively, by calling dup(), the parser is announcing, "I am about to surrender this value to an action; please give me a value to use in its place next time."

Common dup() strategies:

4.2 del

A natural counterpart to dup(), del() accepts values that are not going to be passed to any more actions (this happens when, for example, one of the potential parsers fails to make further progress). It does not return anything.

Common del() strategies:

4.3 merge

An ambiguity is the condition when a single sequence of tokens can be parsed as some nonterminal in more than one way. During parsing, when an ambiguity is encountered, the semantic values from the different parses are passed to the nonterminal's merge() function, two at a time.

Merge accepts two competing semantic value arguments, and returns a semantic value that will stand for the ambiguous region in all future reductions. Both the arguments and the return value have the type of the nonterminal's usual semantic values.

If there are more than two parses, the first two will be merged, the result of which will be merged with the third, and so on until they are all merged. At each step, the first argument is the one that may have resulted from a previous merge(), and the second argument is not (unless it is the result of merging from further down in the parse forest).

Common merge() strategies:

4.4 keep

Sometimes, a potential ambiguity can be prevented if a semantic value can be determined to be invalid in isolation (as opposed to waiting to see a competing alternative in merge()). To support such determination, each nonterminal can have a keep() function, which returns true if its semantic value argument should be retained (as usual) or false if its argument should be suppressed, as if the reduction never happened.

If keep returns false, the parser does not call del() on that value; it is regarded as disposed by keep.

Common keep strategies:

5. Options

A number of variations in parser generator behavior can be requested through the use of the option syntax:

option option_name;
or, for options that accept an argument:
option option_name option_argument;
The various options are described in the following sections.

5.1 useGCDefaults

The command

  option useGCDefaults;
instructs the parser generator to make the tacit assumption that sharing management is automatic (e.g. via a garbage collector), and hence set the default terminal and nonterminal functions appropriately.

In fact, most users of Elkhound will probably want to specify this option during initial grammar development, to reduce the amount of specification needed to get started. The rationale for not making useGCDefaults the global default is that users should be aware that the issue of sharing management is being swept under the carpet.

5.2 defaultMergeAborts

The command

  option defaultMergeAborts;
instructs the parser generator that if the grammar does not specify a merge() function, the supplied default should print a message and then abort the program. This is a good idea once it is believed that all the ambiguities have been handleded by merge() functions.

5.3 Expected conflicts, unreachable symbols

Nominally, the parser generator expects there to be no shift/reduce and no reduce/reduce conflicts, and no unreachable (from the start symbol) symbols. Of course, the whole point of using GLR is to allow conflicts, but it is still generally profitable to keep track of how many conflicts are present at a given stage of grammar development, since a sudden explosion of conflicts often indicates a grammar bug.

So, the user can declare how many conflicts of each type are expected. For example,

  option shift_reduce_conflicts 40;
  option reduce_reduce_conflicts 30;
specifies that 40 S/R conflicts and 30 R/R conflicts are expected. If the parser generator finds matching statistics, it will suppress reporting of such statistics; if there is a difference, it will be reported.

Similarly, one can indicate the expected number of unreachable syhmbols (this usually corresponds to a grammar in development, where part of the grammar has been deliberately disabled by making it inaccessible):

  option unreachable_nonterminals 3;
  option unreachable_terminals 2;

5.4 lang_OCaml

By default, Elkhound generates a parser in C++. By specifying

  option lang_OCaml;
the user can request that instead the parser generate OCaml code. Please see ocaml/, probably starting with the example driver ocaml/main.ml.

Valid HTML 4.01!