Elkhound Manual

This page describes the input format for grammars for the Elkhound parser generator, and the features of that generator. Another page describes the module structure of its implementation.

If you'd like to look at a simple grammar while reading this description, see examples/arith/arith.gr, a parser for simple arithmetic expressions.

1. Lexical structure

The grammar file format is free-form, meaning that all whitespace is considered equivalent. In the C tradition, grouping is generally denoted by enclosing things in braces ("{" and "}"). Strings are enclosed in double-quotes ("").

Grammar files may include other grammar files, by writing include("other_file_name").

Comments can use the C++ "//" syntax or the C "/**/" syntax.

2. Context Class

The parser's action functions are all members of a C++ context class. As the grammar author, you must define the context class. The class is introduced with the "context_class" keyword, followed by ordinary C++ syntax for classes (ending with "};");

3. Terminals

The must declare of all the tokens, also called terminals. A block of terminal declarations looks like:
  terminals {
    0 : TOK_EOF;
    1 : TOK_NUMBER;              // no alias
    2 : TOK_PLUS     "+";        // alias is "+" (including quotes)
    3 : TOK_MINUS    "-";
    4 : TOK_TIMES    "*";
    5 : TOK_DIVIDE   "/";
    6 : TOK_LPAREN   "(";
    7 : TOK_RPAREN   ")";
  }

Each statement gives a unique numeric code (e.g. 3), a canonical name (e.g. TOK_MINUS), and an optional alias (e.g. "-"). Either the name or the alias may appear in the grammar productions, though the usual style is to use aliases for tokens that always have the same spelling (like "}"), and the name for others (like TOK_NUMBER);

Normally it's expected the tokens will be described in their own file, and make-token-files will create the token list seen above.

4. Nonterminals

Following the terminals, the bulk of the grammar is one or more nonterminals. Each nonterminal declaration specifies all of the productions for which it is the left-hand-side.

A simple nonterminal might be:

  nonterm(int) Exp {
    -> e1:Exp "+" e2:Exp        { return e1 + e2; }
    -> n:TOK_NUMBER             { return n; }
  }

The type of the semantic value yielded by the productions is given in parentheses, after the keyword "nonterm". In this case, int is the type. The type can be omitted if productions do not yield interesting semantic values.

In the example, Exp has two productions, Exp -> Exp "+" Exp and Exp -> TOK_NUMBER. The "->" keyword introduces a production.

Right-hand-side symbols can be given names, by putting the name before a colon (":") and the symbol. These names can be used in the action functions to refer to the semantic values of the subtrees (like Bison's $1, $2, etc.). Note that action functions return their value, as opposed to (say) assigning to $$.

5. Nonterminal Functions

There are four kinds of nontermial functions:

TODO: I need to describe these. For now, see the Tech Report.

Valid HTML 4.01!