Elsa

Elsa: The Elkhound-based C/C++ Parser

Elsa is a C and C++ parser. It is based on the Elkhound parser generator. It lexes and parses input C/C++ code into an abstract syntax tree. It does some type checking, in the interest of elaborating the meaning of constructs, but it does not (yet?) reject all invalid programs.

To download Elkhound and Elsa, see the Elkhound distribution page.

High-level documentation:

design.html: Document explaining various aspects of the internal design of Elsa.
tutorial.html: Introduction to using and modifying Elsa.
cc.ast.html: The C/C++ abstract syntax tree created by the parser.
cc_type.html: The type representation objects created by the type checker.
cpp_er.html: C++ Entities and Relationships. Provides an overview of C++ static semantics.

Low-level documentation:

serialization.txt: Explains the XML serialization architecture, design decisions, how to use it, etc.
declarator.html: Some details about how declarators are parsed.
convertibility.txt: A discussion of the standard-convertibility relation, and its application to operator overload resolution.
lookup.txt: Documents some of my interpretations of the lookup rules specified in the C++ standard, and how they are implemented in Elsa.
complex.txt: Brief overview of the degree to which GNU/C99 complex/imaginary types are handled in Elsa.
permissive.txt: Explanation of Elsa's "permissive" mode, which is useful during automatic minimization.
coloncolon.txt: Documents how an ambiguity relating to the "::" operator is handled in cc.gr.

Elsa requires the following external software:

elkhound, a GLR parser generator.
ast, a system for making abstract syntax trees.
smbase, a utility library.
Flex, a lexical analyzer generator.

Build instructions:

  $ ./configure
  $ make
  $ make check

./configure understands these options. You can also look at the Makefile.

Parsing some sample (already preprocessed) input:

  $ ./ccparse in/t0001.cc

The above command will parse and type check the given file. To make it print the annotated, post-type-check AST, say

  $ ./ccparse -tr printTypedAST in/t0001.cc

Additional -tr flags of interest:

printAST: Print the (possibly ambiguous) AST before type checking.
printTypedAST: Print the AST after type checking.
env: Print environment modifications as they happen.
disamb: Print disambiguation activity.
printHierarchies: Print inheritance hierarchies in Dot format. Interesting in that virtual inheritance is represented properly; for example in/std/3.4.5.cc yields 3.4.5.png.
mustBeUnambiguous: After type checking, scan the AST to verify there are no remaining ambiguities. If there are, abort.
prettyPrint: Print out the AST as C++. This is still somewhat incomplete.

The -tr flags can be passed separately, or strung together separated by commas (e.g. "-tr env,disamb,printAST").

Module List:

ast_build.h, ast_build.cc: Some utilities for constructing fragments of the C++ AST.
baselexer.h, baselexer.cc: Intermediate Lexer abstraction, built on top of yyFlexLexer and implementing LexerInferface (thus fitting between flex and Elkhound), but not specific to any set of tokens. Lexer (lexer.h) builds on top of this.
builtinops.h, builtinops.cc: Representation of built-in operators, for use during operator overload resolution.
cc.ast: C/C++ Abstract Syntax Tree. This is the most important file in the parser, since it defines the interface between the parser and everything else that comes after it. It is documented separately in cc.ast.html.
cc.gr: C/C++ parsing grammar. This is the second-most important file, as it tells Elkhound how to parse the token stream. This grammar is based on that in the C++ Standard document, but then modified to remove unnecessary ambiguities and improve the grammar's ability to extract structure.
cc_ast_aux.cc: Some auxilliary functions for cc.ast.
cc_elaborate.ast, cc_elaborate.h, cc_elaborate.cc: This module finds implicit function calls (like constructors) and creates an explicit representation of them. An analysis can then ignore implicit calls and just use the constructed explicit AST.
cc_env.h, cc_env.cc: Env, the type checking environment. Fundamentally just a stack of Scopes (cc_scope.h), plus some global type checking state.
cc_err.h, cc_err.cc: ErrorMsg, an object for representing type checking errors. For now it's just an error string plus some metadata (like source location), but I plan to evolve it to include more structured data like pointers to (instead of just string representations of) the types involved in the error.
cc_flags.h, cc_flags.cc: This module defines a variety of enums relevant to parsing and type checking C++, including enums for all the built-in types, operators, etc.
cc_lang.h, cc_lang.cc: CCLang, a package of language dialect options. Setting flags in this class tells the lexer, parser and type checker what language options to support (e.g. C vs. C++).
cc_print.ast, cc_print.h, cc_print.cc: cc_print is a module to pretty-print the AST using C++ syntax. It extends the AST with entry points for printing.
cc_scope.h, cc_scope.cc: A Scope is two maps: variables and types. The environment (cc_env.h) consists of a stack of them.
cc_tcheck.ast, cc_tcheck.cc: This is the type checker. It consists of an AST extension to add type checking entry points and annotations, and an implementation of all of those type checking functions. It's the most complicated part of the parser.
cc_tokens.tok: This file lists all of the kinds of tokens the lexer recognizes. It's designed to be extended simply by appending. The script make-token-files takes this as input, and generates cc_tokens.h, cc_tokens.cc and cc_tokens.ids. This last file is then included into cc.gr (the others participate in compilation in the obvious way).
cc_type.h, cc_type.cc: This module defines the representation of types. They form the core of the data manipulated by the type checker. They are documented separately in cc_type.html.
ccparse.h, ccparse.cc: This module defines part of the parser context class, and assists minimally with parsing.
cfg.ast, cfg.h, cfg.cc: This is type-checking extension that computes a statement-level control flow graph for each function.
const_eval.h, const_eval.cc: Constant-expression evaluator. Tries to predict the effect of coercing data among different representation sizes, among other things.
generic_amb.h: This is the generic ambiguity resolution procedure. It typechecks all of the alternatives, and selects the one that passes. Note that there are other ambiguity resolution procedures in use, but this is the one used in the absence of a specialized procedure.
generic_aux.h: Some routines for printing and modifying AST nodes that have ambiguity pointers.
gnu.lex, gnu_ext.tok, gnu.gr, gnu.ast, gnu.cc: These files comprise the "gnu" extension module, though in truth this contains extensions for both gcc and C99. See gnu.gr for a complete list of the extensions implemented.
implconv.h, implconv.cc: This module represents and computes implicit conversions, as defined in sections 13.3.3.1 and 13.3.3.2 of the C++ standard.
implint.h, implint.cc: Support routines, including ambiguity resolution, for the implicit-int K&R extension.
kandr.gr, kandr.ast, kandr.cc: K&R extensions, in particular K&R function definitions and the implicit-int rule. Daniel Wilkerson implemented most of this.
cc.lex, lexer.h, lexer.cc: This module chops up a given C++ source file into tokens. It does not do any preprocessing, so one must use an external preprocessor first.
lookupset.h, lookupset.cc: Class to store the result set of a lookup.
main.cc: This module contains the main() function of the parser. It's a simple driver around the other modules. The nominal intent is that people who want to use parts of Elsa in their own projects users will copy and modify this file as necessary.
mangle.h, mangle.cc: This is a very rudmentary name mangler. It is a somewhat arbitrary injective map from Types to character strings, for use by the Oink linker imitator (identifying declarations of the same entity from different translation units). It does not implement any standard mangling scheme.
matchtype.h, matchtype.cc: Type matching in the presence of type variables corresponding to template parameters; sort of a generalized Type::equals.
overload.h, overload.cc: Does overload resolution of a given candidate set.
parssppt.h, parssppt.cc: This is a poorly-designed module intended to abstract some of the functionality otherwise common to main()-providing modules. It needs to die. alt.parssppt.die.die.die.
semgrep.cc: Sample application of Elsa, a "semantic grep". This is part of the tutorial.
serialno.h, serialno.cc: This is a simple module that can be used to attach object creation serial numbers when an appropriate compile-time switch is used. This is sometimes more convenient than working with virtual addresses, while debugging.
sprint.h, sprint.cc: "Structure printer"; work in progress.
stdconv.h, stdconv.cc: Represents and computes standard conversions, as defined in section 4 of the C++ standard. See also convertibility.txt.
strmap.h: Hashtable-based map from StringRef to some pointer.
template.h, template.cc: Data structures and algorithms for the template instantiation implementation.
tlexer.cc: Simple test driver program for the lexer.
typelistiter.h, typelistiter.cc: Generic interface, plus a couple of implementations, for iterating over sequences and examining their stored types.
variable.h, variable.cc: Variable, a class for holding information about names in the "variable" namespace. See variable.h for a list of the kinds of things that get represented with Variables. This module is closely related to cc_type.

Module dependency diagram:
Module dependencies
Or, in Postscript.

Miscellanous files:

chop_out: This script extracts pretty-printed C++ syntax from the other debugging output produced by ccparse.
extradep.mk: Build-time dependencies among auto-generated source files. Produced by elkhound/find-extra-deps.
idemcheck: Script to verify that parsing then pretty-printing is idempotent.
in: Directory with testcases.
include: When preprocessing, add this directory to the preprocessor's search path. It contains compiler-specific headers. Generally I just use gcc's headers, but some of gcc's headers use syntax that Elsa doesn't (yet?) understand, so this directory contains my replacements.
merge-lexer-exts.pl: Merge a base flex lexer with one or more extensions.
multitest.pl: Used by the regression tester to test a given input file, plus several variations obtained by un-commenting certain lines.
regrtest: Regression tests.
run-delta-loop: Minimize tmp.i exhibiting some specified error message.
test-for-error: Test for exhibition of a particular error; used by run-delta-loop.
test-parse: Script to parse a file, making sure the parse is unambiguous.
test-parse-buildlog: This is a script that interprets the output of 'make' in order to find C++ inputs to test with Elsa. I use it to make claims like "Elsa can parse Mozilla".