Elsa: The Elkhound-based C/C++ Parser
Elsa is a C and C++ parser. It is based on the Elkhound parser generator. It lexes
and parses input C/C++ code into an abstract syntax tree. It does
some type checking, in the interest of elaborating the meaning of
constructs, but it does not (yet?) reject all invalid programs.
To download Elkhound and Elsa, see the
Elkhound distribution page.
High-level documentation:
- design.html: Document explaining various
aspects of the internal design of Elsa.
- tutorial.html: Introduction to using and
modifying Elsa.
- cc.ast.html: The C/C++ abstract syntax tree
created by the parser.
- cc_type.html: The type representation
objects created by the type checker.
- cpp_er.html: C++ Entities and Relationships.
Provides an overview of C++ static semantics.
Low-level documentation:
- serialization.txt: Explains the
XML serialization architecture, design decisions, how to use it, etc.
- declarator.html: Some details about how
declarators are parsed.
- convertibility.txt: A discussion
of the standard-convertibility relation, and its application to
operator overload resolution.
- lookup.txt: Documents some of my
interpretations of the lookup rules specified in the C++ standard,
and how they are implemented in Elsa.
- complex.txt: Brief overview of the
degree to which GNU/C99 complex/imaginary types are handled in Elsa.
- permissive.txt: Explanation of Elsa's
"permissive" mode, which is useful during automatic minimization.
- coloncolon.txt: Documents how an
ambiguity relating to the "::" operator is handled in cc.gr.
Elsa requires the following external software:
- elkhound, a GLR parser generator.
- ast, a system for making abstract syntax trees.
- smbase, a utility library.
- Flex,
a lexical analyzer generator.
Build instructions:
$ ./configure
$ make
$ make check
./configure understands
these options. You can also
look at the Makefile.
Parsing some sample (already preprocessed) input:
$ ./ccparse in/t0001.cc
The above command will parse and type check the given file. To
make it print the annotated, post-type-check AST, say
$ ./ccparse -tr printTypedAST in/t0001.cc
Additional -tr flags of interest:
- printAST: Print the (possibly ambiguous) AST before type checking.
- printTypedAST: Print the AST after type checking.
- env: Print environment modifications as they happen.
- disamb: Print disambiguation activity.
- printHierarchies: Print inheritance hierarchies in
Dot format.
Interesting in that virtual inheritance is represented properly;
for example in/std/3.4.5.cc yields
3.4.5.png.
- mustBeUnambiguous: After type checking, scan the AST to verify there
are no remaining ambiguities. If there are, abort.
- prettyPrint: Print out the AST as C++. This is still somewhat incomplete.
The -tr flags can be passed separately, or strung together
separated by commas (e.g. "-tr env,disamb,printAST").
Module List:
- ast_build.h,
ast_build.cc:
Some utilities for constructing fragments of the C++ AST.
- baselexer.h,
baselexer.cc:
Intermediate Lexer abstraction, built on top of yyFlexLexer and implementing
LexerInferface (thus fitting between flex and Elkhound), but not specific to
any set of tokens. Lexer (lexer.h) builds on top of this.
- builtinops.h,
builtinops.cc:
Representation of built-in operators, for use during operator
overload resolution.
- cc.ast:
C/C++ Abstract Syntax Tree. This is the most important
file in the parser, since it defines the interface between
the parser and everything else that comes after it. It is
documented separately in cc.ast.html.
- cc.gr:
C/C++ parsing grammar. This is the second-most important file,
as it tells Elkhound how to parse the token stream. This grammar
is based on that in the C++ Standard document, but then modified
to remove unnecessary ambiguities and improve the grammar's ability
to extract structure.
- cc_ast_aux.cc:
Some auxilliary functions for cc.ast.
- cc_elaborate.ast,
cc_elaborate.h,
cc_elaborate.cc:
This module finds implicit function calls (like constructors) and creates
an explicit representation of them. An analysis can then ignore implicit
calls and just use the constructed explicit AST.
- cc_env.h,
cc_env.cc:
Env, the type checking environment. Fundamentally just a stack of
Scopes (cc_scope.h), plus some global
type checking state.
- cc_err.h,
cc_err.cc:
ErrorMsg, an object for representing type checking errors. For now
it's just an error string plus some metadata (like source location),
but I plan to evolve it to include more structured data like pointers
to (instead of just string representations of) the types involved in
the error.
- cc_flags.h,
cc_flags.cc:
This module defines a variety of enums relevant to parsing and
type checking C++, including enums for all the built-in types,
operators, etc.
- cc_lang.h,
cc_lang.cc:
CCLang, a package of language dialect options. Setting flags in
this class tells the lexer, parser and type checker what language
options to support (e.g. C vs. C++).
- cc_print.ast,
cc_print.h,
cc_print.cc:
cc_print is a module to pretty-print the AST using C++ syntax. It
extends the AST with entry points for printing.
- cc_scope.h,
cc_scope.cc:
A Scope is two maps: variables and types. The environment (cc_env.h) consists of a stack of them.
- cc_tcheck.ast,
cc_tcheck.cc:
This is the type checker. It consists of an AST extension to
add type checking entry points and annotations, and an implementation
of all of those type checking functions. It's the most complicated
part of the parser.
- cc_tokens.tok:
This file lists all of the kinds of tokens the lexer recognizes. It's
designed to be extended simply by appending. The script
make-token-files
takes this as input, and generates
cc_tokens.h,
cc_tokens.cc and
cc_tokens.ids. This last file is then
included into cc.gr (the others participate in
compilation in the obvious way).
- cc_type.h,
cc_type.cc:
This module defines the representation of types. They
form the core of the data manipulated by the type checker.
They are documented separately in
cc_type.html.
- ccparse.h,
ccparse.cc:
This module defines part of the parser context class, and assists
minimally with parsing.
- cfg.ast,
cfg.h,
cfg.cc:
This is type-checking extension that computes a statement-level
control flow graph for each function.
- const_eval.h,
const_eval.cc:
Constant-expression evaluator. Tries to predict the effect of
coercing data among different representation sizes, among other things.
- generic_amb.h:
This is the generic ambiguity resolution procedure. It typechecks
all of the alternatives, and selects the one that passes. Note that
there are other ambiguity resolution procedures in use, but this is
the one used in the absence of a specialized procedure.
- generic_aux.h:
Some routines for printing and modifying AST nodes that have
ambiguity pointers.
- gnu.lex,
gnu_ext.tok,
gnu.gr,
gnu.ast,
gnu.cc:
These files comprise the "gnu" extension module, though in truth this contains
extensions for both gcc and C99. See gnu.gr for a complete
list of the extensions implemented.
- implconv.h,
implconv.cc:
This module represents and computes implicit conversions, as defined
in sections 13.3.3.1 and 13.3.3.2 of the C++ standard.
- implint.h,
implint.cc:
Support routines, including ambiguity resolution, for the implicit-int
K&R extension.
- kandr.gr,
kandr.ast,
kandr.cc:
K&R extensions, in particular K&R function definitions and the
implicit-int rule. Daniel Wilkerson implemented most of this.
- cc.lex,
lexer.h,
lexer.cc:
This module chops up a given C++ source file into tokens. It does
not do any preprocessing, so one must use an external preprocessor
first.
- lookupset.h,
lookupset.cc:
Class to store the result set of a lookup.
- main.cc:
This module contains the main() function of the parser. It's a simple
driver around the other modules. The nominal intent is that people who
want to use parts of Elsa in their own projects users will copy and modify
this file as necessary.
- mangle.h,
mangle.cc:
This is a very rudmentary name mangler. It is a somewhat arbitrary injective
map from Types to character strings, for use by the Oink linker imitator
(identifying declarations of the same entity from different translation units).
It does not implement any standard mangling scheme.
- matchtype.h,
matchtype.cc:
Type matching in the presence of type variables corresponding to template
parameters; sort of a generalized Type::equals.
- overload.h,
overload.cc:
Does overload resolution of a given candidate set.
- parssppt.h,
parssppt.cc:
This is a poorly-designed module intended to abstract some of the
functionality otherwise common to main()-providing modules. It
needs to die. alt.parssppt.die.die.die.
- semgrep.cc:
Sample application of Elsa, a "semantic grep". This is part
of the tutorial.
- serialno.h,
serialno.cc:
This is a simple module that can be used to attach object creation
serial numbers when an appropriate compile-time switch is used. This
is sometimes more convenient than working with virtual addresses,
while debugging.
- sprint.h,
sprint.cc:
"Structure printer"; work in progress.
- stdconv.h,
stdconv.cc:
Represents and computes standard conversions, as defined in section
4 of the C++ standard. See also
convertibility.txt.
- strmap.h:
Hashtable-based map from StringRef to some pointer.
- template.h,
template.cc:
Data structures and algorithms for the template instantiation implementation.
- tlexer.cc:
Simple test driver program for the lexer.
- typelistiter.h,
typelistiter.cc:
Generic interface, plus a couple of implementations, for iterating
over sequences and examining their stored types.
- variable.h,
variable.cc:
Variable, a class for holding information about names in the
"variable" namespace. See
variable.h for a list of the kinds
of things that get represented with Variables. This module
is closely related to cc_type.
Module dependency diagram:
Or, in Postscript.
Miscellanous files:
- chop_out:
This script extracts pretty-printed C++ syntax from the other
debugging output produced by ccparse.
- extradep.mk:
Build-time dependencies among auto-generated source files.
Produced by
elkhound/find-extra-deps.
- idemcheck:
Script to verify that parsing then pretty-printing is idempotent.
- in:
Directory with testcases.
- include:
When preprocessing, add this directory to the preprocessor's
search path. It contains compiler-specific headers. Generally
I just use gcc's headers, but some of gcc's headers use syntax
that Elsa doesn't (yet?) understand, so this directory contains
my replacements.
- merge-lexer-exts.pl:
Merge a base flex lexer with one or more extensions.
- multitest.pl:
Used by the regression tester to test a given input file, plus
several variations obtained by un-commenting certain lines.
- regrtest:
Regression tests.
- run-delta-loop:
Minimize tmp.i exhibiting some specified error message.
- test-for-error:
Test for exhibition of a particular error; used by run-delta-loop.
- test-parse:
Script to parse a file, making sure the parse is unambiguous.
- test-parse-buildlog:
This is a script that interprets the output of 'make' in order to
find C++ inputs to test with Elsa. I use it to make claims like
"Elsa can parse Mozilla".