Part 2: Modifying Elsa

You presumably just came from Part 1, which introduces the Elsa build process and ccparse tool. In Part 2, you will modify Elsa's source code to achieve various effects, and to familiarize yourself with a few of its key data structures.

4. Count the casts

One of the complaints people typically have about C is that one of its most unsafe features, type casts, has just about the hardest syntax to grep for: a parenthesized type, juxtaposed with an expression. But with a parser like Elsa, casts are easy to find, so we will write a simple AST visitor to do just this.

Open main.cc (in elsa/), and near the top (say, after the block of #includes but before DeclTypeChecker), add the following:

  class CastCounter : public ASTVisitor {
  public:      // data
    int count;

  public:      // funcs
    CastCounter() : count(0) {}
    virtual bool visitExpression(Expression *obj);
  };

  bool CastCounter::visitExpression(Expression *obj)
  {
    if (obj->isE_cast()) {
      count++;         // found a cast!
    }
    return true;       // visit my children too
  }

This declares class CastCounter, which inherits from ASTVisitor. ASTVisitor contains a bunch of virtual methods, one for each kind of AST node (superclass, not subclass). By default these methods do nothing, but we can override that behavior in subclasses like CastCounter.

In CastCounter, we have a count member to keep track of how many casts are found. We also override visitExpression, which will be called whenever the visitor encounters an Expression AST node. The implementation of this method simply increments count whenever the expression is an E_cast. (The rest of the AST is defined in cc.ast and documented in cc.ast.html.)

Now, go to the bottom of the doit() function, and just before the call to strTable.clear(), add the following code:

  CastCounter cc;
  unit->traverse(cc);
  cout << inputFname << ": found " << cc.count << " casts\n";

This code creates an instance of CastCounter, initiates an AST traversal with traverse(), and finally prints out the count.

Rebuild ccparse with these changes:

  elsa$ make

Now, run it on some input:

  elsa$ ./ccparse in/t0001.cc
  %%% progress: 0ms: done parsing (1 ms, 0_249063 cycles)
  ambiguous nodes: 0
  %%% progress: 4ms: done type checking (3 ms)
  typechecking results:
    errors:   0
    warnings: 0
  %%% progress: 6ms: done elaborating (0 ms)
  in/t0001.cc: found 0 casts

Well that wasn't too exciting, there aren't any casts in in/t0001.cc. Here are some other files to try:

t0001.cc: 0 casts
t0004.cc: 1 cast
t0014.cc: 1 function-style cast, plus one each of the keyword casts
t0059.cc: 3 casts
t0117.cc: 184 casts

One interesting aspect of in/t0014.cc is that the "cast" on line 11 is actually parsed as an E_constructor (constructor call), but because the type it constructs is not a class, the type checker replaces it with an E_cast. This makes the meaning of the code more obvious to an analysis: there is no function call there, despite the syntax suggesting it.

If you want, you can verify the absence of the E_cast by moving the cast-counting code up to between the parser and type checking phases in doit().

5. Print the casts

Let's try something a little more ambitious: instead of just counting the casts, we will print for each cast its source location, the source expression, the destination type, and which C++ cast keyword it uses (if any).

To accomplish all of this, replace the existing visitExpression implementation with the following:

  // find the location of a type-id; as locations are not stored on
  // all AST nodes, we must dig a little to get it
  inline SourceLoc locOfType(ASTTypeId *id)
  {
    return id->decl->decl->loc;
  }

  bool CastCounter::visitExpression(Expression *obj)
  {
    if (obj->isE_cast()) {
      count++;
      E_cast *cast = obj->asE_cast();

      cout << toString(locOfType(cast->ctype))
           << ": C cast of `" << cast->expr->exprToString()
           << "' to type `" << cast->ctype->getType()->toString() << "'\n";
    }

    else if (obj->isE_keywordCast()) {
      count++;
      E_keywordCast *cast = obj->asE_keywordCast();

      cout << toString(locOfType(cast->ctype))
           << ": " << toString(cast->key)
           << " of `" << cast->expr->exprToString()
           << "' to type `" << cast->ctype->getType()->toString() << "'\n";
    }

    return true;       // visit my children too
  }

Source locations have type SourceLoc, which is defined in srcloc.h. Not all AST nodes have locations, because I was concerned about wasting space. As in this example, there's usually a location "nearby" that will suffice. The locOfType function goes and gets it.

The printing code uses the toString(SourceLoc) function, the toString(CastKeyword) function, the Expression::exprToString() method, and the Type::toString method to print the AST components. Most things in Elsa can be printed with such routines.

Since keyword casts are their own AST node, there are two conditional blocks. Each uses a "downcast" method such as asE_cast() or asE_keywordCast(), only after checking the type with a type interrogation method such as isE_cast() or isE_keywordCast(). If you call a downcast method and the object is not of the appropriate type, an assertion will fail. (You can also use the C++ RTTI mechanisms like dynamic_cast to do this; I just prefer to do it my way.)

Build and run this program:

  elsa$ make
  [...]
  elsa$ ./ccparse in/t0014.cc
  %%% progress: 0ms: done parsing (1 ms, 1_080294 cycles)
  ambiguous nodes: 2
  %%% progress: 6ms: done type checking (5 ms)
  typechecking results:
    errors:   0
    warnings: 0
  %%% progress: 9ms: done elaborating (1 ms)
  in/t0014.cc:11:7: C cast of ` (6)' to type `int'
  in/t0014.cc:28:21: const_cast of ` (x)' to type `int'
  in/t0014.cc:29:23: dynamic_cast of ` (x)' to type `int'
  in/t0014.cc:30:22: static_cast of ` (x)' to type `int'
  in/t0014.cc:31:27: reinterpret_cast of ` (x)' to type `int'
  in/t0014.cc: found 5 casts

One annoyance with the output above is the leading space that gets printed before the expressions. That comes from the pretty printer module, cc_print (cc_print.ast, cc_print.h and cc_print.cc), because it is trying to avoid problems with printing things too close together. Eventually, the pretty printer will be fixed to not do that. A simple short-term fix, if you're inclined, is to pass the expression text through trimWhitespace() like this (be sure to #include strutil.h):

  << ": C cast of `" << trimWhitespace(cast->expr->exprToString())

6. Semantic grep

A frequent programmer's task is to find all uses of some variable or function. People often use the Unix grep utility for this task, but grep is easily confused by entities that have the same name (but are declared in different scopes). Using Elsa, we can write a tool that will report all uses of a given variable.

Rather than have you type in the code for the grepper, it is a part of the distribution tarball, in semgrep.cc. The targets for building it are already in the Makefile, so let's build and use it right away:

  elsa$ make semgrep
  [...]
  elsa$$ ./semgrep f 6 in/t0005.cc
  in/t0005.cc:6:6: declaration
  in/t0005.cc:13:4: use as variable
  in/t0005.cc:15:3: use as variable

I've chosen to denote variables by their name and the line on which they are declared or defined. This is both reasonably convenient and unambiguous. The usage is then

  ./semgrep <name> <line> input.cc

Try a few more examples.

Now let's look at how semgrep.cc works. Near the top is GrepVisitor, the visitor that looks for uses of a variable. It visits E_variable and E_fieldAcc nodes to find uses in expressions. It also looks in Declarators, to try to find the definition (of, say, a function).

Both kinds of expressions are annotated by the type checker with a field "var", which points at the Variable object that represents the variable referred-to. The var field is declared in cc_tcheck.ast, an extension module to cc.ast that specifies what annotations the type checker adds.

Variable objects are used for quite a few things, including functions; see variable.h for more details. Each variable has an associated location, and that is what the grep checks. When a hit is found, the location of the reference to that variable is printed.

The Declarator case works similarly, as it has been annotated with the Variable to which the declarator refers.

One problem with this grepper is that Variable only has one location, but a variable can be declared in multiple places, and it's not always obvious which location will be stored there (look in variable.h for details). A better grepper would use one traversal to robustly find the Variable of interest using any of a variety of identification criteria, and then another traversal to identify uses. I leave this as an exercise for the interested reader.

7. Example language extension

This is for the moment a TODO item ....

8. Directions for further exploration

(outline)

cc.ast
cc_type
cc.gr (just skim it)
cc_tcheck.cc (skim)
look at the regression tester, ERROR lines, etc; learn how to add a new test
... what else?