cc_type: Type Representation

This page documents cc_type.h and cc_type.cc, the files which make up the type representation module in Elsa. You should look at cc_type.h as you read this file.

Note that Diagram 1 of C++ Entities and Relationships puts the concepts on this page in context with the rest of the language, in particular the lookup infrastructure.

1. Introduction

The broad intent of cc_type is to represent types in a way that's easy to understand and manipulate. The abstract syntax of C/C++ type declarations is very much not amenable to either, because it shares type specifiers among declarators in a declaration, and because declarators are inside-out (see the description of Declarator in cc.ast.html). The type checker has the responsibility of interpreting the C/C++ syntax, and constructing corresponding Type objects so that subsequent analyses don't have to deal with the actual syntax.

Another goal of the type representation is general independence from syntax. Ideally, cc_type would not refer to cc.ast at all, but I haven't quite been able to achieve that. Nevertheless, it remains a goal to minimize the knowledge cc_type has about cc.ast. Types are concepts that exist independently from any particular syntax used to describe them, and I want the module dependencies to reflect that basic fact to the extent possible.

One major decision I made for this module was to translate typedefs away entirely. This means I don't have a node for some kind of arbitrary "named type", which would have to be "unwrapped" every place a type is inspected. The potential disadvantage is that I won't use the programmer's names when referring to types in error messages, but there's an easy solution to that too: every constructed type could have a field saying what name, if any, the programmer had been using as an alias for this type. That's not implemented, but I'm confident it would eliminate any advantage to retaining typedefs. (Update: It has recently been implemented, see Type::typedefAliases.)

Types are divided into two major classes: atomic types and constructed types. Atomic types (my terminology) are types atop which type constructors (like "*" or "[]") might be applied, but which themselves do not have any constructors, so cannot be further "deconstructed" (for example, by template pattern matching). They consist of built-in types like "int", enums, structs, classes and unions. I regard the aggregation of structs/classes/unions to form an atomic wrapper around their members, in part because two (e.g.) struct types are equal only if they arise from the exact same declaration; there is no structural equality for structs. Each kind of atomic type is explained in Section 2.

Constructed types, by constrast, are whatever is built on top of atomic types by applying type constructors such as "*". These types can be deconstructed by template unification

  template <class T>
  void foo(T *p) { ... }      // could pull "int" out of "int *"
and typedefs can be used to build them in pieces
  typedef int myint;
  myint const *p;             // builds "int const"
(again, in contrast to the AtomicTypes). Each type constructor is explained in Section 3.

If you're really bothered by the apparent contradiction that CompoundType is an AtomicType, think of the "struct { ... }" syntax as wrapping the members in an opaque container, thus forming an indivisble unit (i.e. an "atom"). This is of course exactly what happens from the compiler's point of view, and the basis for the terminology. Quoting from Dennis Ritchie's The Development of the C Language, section "Embryonic C":

The central notion I captured from Algol was a type structure based on atomic types (including structures), composed into arrays, pointers (references), and functions (procedures).

2. Atomic Types

Atomic Type Hierarchy

AtomicType is the root of the atomic type hierarchy. It just contains methods to figure out what kind of type it actually is.

SimpleType is represents simple types built-in to C/C++, like "int". cc_flags.h contains the definition of SimpleTypeId. Note that there is no attempt to break down primitive types along dimensions like "signed" or "long", since the set of types is not orthogonal w.r.t. those dimensions (e.g. there is no "signed float" or "long char"). If an analysis wants (e.g.) an "isSigned" function, it should provide it itself, and make the associated decisions about what "isSigned(ST_FLOAT)" returns.

There are some SimpleTypeId codes that do not correspond to primitive C/C++ types, but are used instead by the front-end for various intermediate communication purposes. For example, ST_ERROR represents the type of an expression that contained an error, and is used to try to reduce the amount of error cascading. There are also some ids used to help implement section 13.6 of the C++ standard. Analyses generally won't have to concern themselves with such oddball values, since they shouldn't be visible outside the type checker implementation.

CompoundType is certainly the most complex of all the type classes. It represents a class, struct, or union. Storage of class members is done by inheriting a Scope (cc_scope.h). Thus, CompoundType itself is mostly concerned with representing the inheritance hierarchy.

To get the name lookup semantics right, we must be able to tell when names are ambiguous and also know the inheritance path used to arrive at any base. This is complicated by the presence of virtual inheritance. Each class contains its own, independent representation of its particular inheritance DAG; this representation is a graph of BaseClassSubobj objects. A base class subobject corresponds one-to-one to some particular (contiguous) range of byte offsets in the final object's layout.

CompoundType knows how to print out its DAG to Dot format; see CompoundType::renderSubobjHierarchy. For example, in/std/3.4.5.cc yields 3.4.5.png.

EnumType represents an enum. It has a list of enumerator values, which are also added to the Env (cc_env.h) during type checking.

3. Constructed Types

Constructed Type Hierarchy

Type is the root of the constructed type hierarchy. Like AtomicType, it has methods to find out which kind of type a particular object is (roll-my-own RTTI). It also has a variety of query methods (like isVoid()), and entry points for printing types.

CVAtomicType is always the leaf of a constructed type. It just wraps an AtomicType, but possibly with const or volatile qualifiers.

PointerType represents a pointer type, possibly qualified with const or volatile.

ReferenceType represents a reference type. It has no qualifiers. Note that in the original design, PointerType was used for both pointers and references, so you might find the occasional comment incorrectly referring to that state of affairs. However, you can still treat pointers and references uniformly if you want, by using isPtrOrRef(), getCVFlags(), and getAtType().

FunctionType represents a function type. The parameters are represented as a list of Variable (variable.h) objects. Nonstatic member functions have the isMember flag set, in which case their first parameter is called "this" and has type "C cv &", where "C" is the name of the class of which the function is a member, and cv is optional const/volatile flags. FunctionType also has a flag to indicate if it accepts a variable number of arguments, and a way (ExnSpec) to represent exception specifications.

Finally, function templates have a list of template parameters. As I think about it, it's kind of strange to imagine a function type being templatized, so maybe I should have put that list someplace else (Variable?). On the other hand, maybe it will work to treat templates as having polymorphic types. Until the template implementation matures, it's not clear what the best strategy is.

4/19/04: The above description is out of date; the template parameters have indeed been moved into Variable. However the template design remains in a state of flux so I'm waiting for it to settle before thoroughly documenting it.

ArrayType represents an array type. Types with no size specified have size NO_SIZE.

PointerToMemberType represents a C++ pointer-to-member. It says which class' member it thinks it points at, the type of the referred-to thing, and some optional const/volatile qualifiers. This is used for both pointers to both member functions and member data.

Note that creating a Pointer to a Function that happens to have isMember be true is not the same thing as a PointerToMember to a Function (without isMember). The former is a concept that I think would be useful, but does not exist in C++. The latter is C++'s pointer-to-member (to a function), and has the "heterogeneous array" semantics.

4. Templates

The template design is still somewhat incomplete. I'd like to have a pass that can fully instantiate templates, and so some of this is looking forward to the existence of such a pass.

TypeVariable is used for template functions and classes to stand for a type which is unknown because it's a parameter to the template. It should point at its corresponding Variable, rather than just having a StringRef name...

TemplateParams is a list of template parameters. I pulled this out into its own class for reasons I now don't remember...

ClassTemplateInfo is intended to contain information about template instantiations. It's not used right now.

5. TypeFactory

When the type checker (cc_tcheck.cc) constructs types, it actually does so via the TypeFactory interface. This is to make it possible for someone to build annotations on top of my Types, instead of going in and mucking about in cc_type.h. It has several core functions that must be defined by a derived class, and a variety of other functions with default implementations when such an implementation is "obvious" in terms of the core functions.

The present form of TypeFactory is driven by the existence of one project that has such an annotation system. As new analyses arise that may need to customize the way types are built, I'll add new entry points to TypeFactory and modify cc_tcheck.cc to use them.

I consider the presence of the SourceLoc parameter in most of the TypeFactory functions to be a wart on the design. I don't even know what that location should be, if the type is constructed outside the context of interpreting a type denotation in the AST. The base Elsa front end itself always ignores that value, and passes SL_UNKNOWN whenever there isn't a location handy. I'd like to eventually revisit the decisions that led to the presence of those parameters.

BasicTypeFactory is an implementation of TypeFactory that just builds the types defined in cc_type.h in the obvious way.

6. BaseType and Type

While the TypeFactory mechanism makes it easy to annotate the leaf types, like FunctionType and PointerType, it doesn't offer a way to effectively annotate Type itself, because that class is in the middle of the inheritance hierarchy.

So, Type is actually split into two classes, BaseType and Type. Type inherits from BaseType, and by default, adds almost nothing to BaseType. However, clients of Elsa are allowed to replace Type's definition with one that includes additional annotations.

To replace Type's definition, make a header file that contains the new class definition (make a copy of the existing Type definition as a starting point). Then arrange for the preprocessor symbol TYPE_CLASS_FILE to be set to the name (in quotes) of the file containing the new definition.

The key requirement of class Type is that it must not allow the name "BaseType" to leak out into its interface. I do not want to have to deal with more than one name for "generic type" in analyses like the type checker. Some care is required to achieve this, as I don't want to make the inheritance private (since that would necessitate repeating all of BaseType's member declarations).

Valid HTML 4.01!