cc.ast: The Abstract Syntax Tree description

This page documents cc.ast, the most important file in Elsa. Mainly I document here the big ideas or the tricky points; cc.ast itself should be consulted for the details, and you should probably be looking at that file as you read this page (classes are documented in the same order as they appear in cc.ast).

Note that some of the AST classes, but not all, have source location information (SourceLoc; smbase/srcloc.h). Generally I've just put location info wherever I've happened to need it; there's no clear strategy. Now that SourceLoc is just one word (it used to be three), I might put it everywhere.

ASTVisitor

One way to walk over the AST is to write a set of mutually-recursive functions that manually traverse the AST edges. This works well for analyses that are sensitive to context, and/or use results from subtrees in nontrivial ways. The type checker (cc_tcheck.ast, cc_tcheck.cc) is such an analysis.

An alternative, which is less work if the analysis is mostly insensitive to context and/or does not need to inspect all of the nodes in the AST, is to use ASTVisitor. One simply implements the ASTVisitor interface (defined in cc.ast.gen.h once that file has been generated), and invokes the traverse method on the root of the AST subtree of interest. For example, this code would print all of the function names (without qualifiers) in the AST:

  class FuncPrinter : public ASTVisitor {
  public:
    virtual bool visitFunction(Function *f) {
      cout << f->nameAndParams->getDeclaratorId()->getName() << "\n";
    }
  };

  void printFuncNames(TranslationUnit *unit)
  {
    FuncPrinter fp;
    unit->traverse(fp);
  }

TranslationUnit, TopForm

The entire AST is a list of TopForms, collected into a TranslationUnit. A TopForm is something that appears at toplevel, or at toplevel in a namespace.

Function, MemberInit

All function definitions, whether at toplevel or inside class bodies, get a Function AST node. The function's name and parameters are encoded in the nameAndParams Declarator. Also included are constructor member initializers, and constructor exception handlers, if any.

Declaration

A Declaration is a TypeSpecifier and then some Declarators, plus some optional keywords (DeclFlags) like static or extern.

ASTTypeId

An ASTTypeId is like a Declaration, but there's only one Declarator and no DeclFlags. It's used for function parameters and for the types that appear in the cast syntax.

PQName

A PQName, a possibly-qualified name, is usually just a string (actually a StringRef, a pointer into the StringTable; see ast/strtable.h). However, it might also have qualifiers, those names that can appear before the "::" symbol. To complicate matters further, sometimes the names have template arguments, and sometimes the names refer to operators.

TypeSpecifier, BaseClassSpec, Enumerator

TypeSpecifiers are the first part of Declarations. They mainly correspond to AtomicTypes in the terminology of cc_type: built-in types, enums, structs, classes, and unions. However, via typedefs (TS_name), they can actually refer to constructed types (like pointers) also.

MemberList, Member

Members are elements in a class definition. Typical members are data members or method prototypes (MR_decl), or inline definitions of methods (MR_func). MR_publish is obscure, corresponding to an "access declaration" [cppstd Section 11.3].

Declarator, IDeclarator, ExceptionSpec

The C/C++ syntax for declarators is probably the strangest part of the language to someone new to parsing it. Declarators are the things that come after TypeSpecifiers in Declarations:

  int                  *       x      ,    *     *     y     ;
   |
  TypeSpecifier     <-- Declarator ->    <-- Declarator -->

  <---------------------- Declaration ----------------------->
But they also have a recursive structure, represented in my AST as IDeclarators:
  int       *          *                 y       ;
                                         |
                                      D_name
                       <---- D_pointer ---->
            <--------- D_pointer ---------->


  int    * *      (    *           func    ) (int, int)  ;
                                     |
                                  D_name
                       <-- D_pointer -->
                  <----- D_grouping ------->
                  <-------------- D_func ------------->
           <------------- D_pointer ------------------>
         <------------- D_pointer -------------------->

Now, what really makes them screwy is that they're inside out! Taking the last example above, func is being declared to be a pointer to a function which returns a pointer to pointer to an integer--you read it from right to left. The type checker (cc_tcheck.cc) sorts the types out into a more usable representation (cc_type), but they start as above.

(By the way: declarators are inside-out not because Kernighan and Ritchie are evil or stupid, but because they wanted the syntax of declarations to mirror the syntax of expressions, to try to make the language easier to learn. When I first learned C, I thought it made perfect sense. Only now as a PhD student in computer science do I find it confusing. :)  )

OperatorName

OperatorName just stores the various operator-induced names. The getOperatorName then flattens them down to a string anyway. The tricky one is ON_conversion, which can't be canonically flattened, so this has to be special-cased in code that consumes OperatorNames.

Statement, Condition, Handler

Statement represents statements; it's pretty straightforward. Condition is the thing between the parentheses in conditionals. Handler is what comes after try.

Expression, ExpressionListOpt

Expression represents expressions. Again, straightforward.

Notice that E_stringLit may contain a continuation. That's how string literal concatenation is implemented: in the parser. This is done because the lexer does no interpretation of any kind of literal, but concatenation semantics would require that it do so (e.g. "\xA" "B" is equivalent to "\x0AB", two characters, not "\xAB", one character).

One possibly surprising choice is not to fold E_this into E_variable. The reason E_this is split off is that "this" is a pointer to the receiver object (called "__receiver"), which already is represented by a Variable in the parameter list. If "this" were parsed as an E_variable then its 'var' field would have to point at some other Variable (so the type makes sense), but then we'd have two Variables for the same concept (receiver), which would be extra disconnect for analyses to overcome. Therefore E_this lets us treat "this" the same as "&__receiver".

Initializer

Initializers are what come after IDeclarators in Declarations; they're the "3" in "int x = 3;". They can be nested, as IN_compound.

GNU/C99 designated initializers are implemented in the GNU extension, gnu.ast. The build process that comes with Elsa will include the GNU extension if you pass "-gnu" to ./configure.

TemplateDeclaration, TemplateParameter, TemplateArgument

Representation of templates in the AST is straightforward. Elsa does not (yet?) expand or instantiate templates, however.

2005-03-02: The above is no longer true. See design.html for more info.

Valid HTML 4.01!