2005-08-19 15:46 XML Serialization ----------------- Daniel S. Wilkerson Elsa provides Xml serialization and de-serialization for the AST classes and the Type system classes (which include Scope, Variable, templates etc.) **** The meaning of serialization Elsa provides strong guarantees as to the meaning of serialization; this is not a best-effort feature. More specifically, the following diagrams commute. These diagrams are actually tested in checkxml.mk which contains optional tests that are not run by default but will run if you type "make checkxml". In particular, diagram C is currently tested on 714 test and thus is considered to be very reliable. Note that the later the serialization comes, the more information it must preserve and thus the more difficult it is to make the diagram commute. In diagram B below, "print" means pretty-print. Note that "parse" really means "parse and then typecheck to disambiguate the AST and then discard the derived types". That is, we do not test serializing an ambiguous "ASForest". diagram B: parse ------------------------> typecheck -> print \ / -> toXml -> fromXml - In diagram C below, "print" means debug-print, which Scott says contains strictly more information than pretty-printing. diagram C: parse -> typecheck ------------------------> print \ / -> toXml -> fromXml - There are plans to make diagram D, below, commute, but for that to happen we have to finish deciding exactly what it means 1) to de-serialize a lowered AST with some of the template constructs missing, and 2) to print a lowered AST (should it print the template instantiations without the primary in such a way that it would compile and produce a program with the same semantics?). diagram D: parse -> typecheck -> lower ------------------------> print \ / -> toXml -> fromXml - **** How serialization works The Elsa AST classes are maintained in Scott's AST language in .ast files and are generated by the astgen tool. Generic traverse() methods are generated on each AST class for use with the Visitor design pattern; the Visitor class is also generated. AST Xml serialization is done as a Visitor using this pattern. All of this auto-generation is necessitated by the fact that there are over 140 AST classes; it also ensures uniformity. The Elsa Type-system classes, by which we also mean Scope, Variable, and template related classes, are serialized with a special-purpose serialization walk rather than a Visitor design pattern. There is a generic type-system traverse() but it is not used; a special-purpose walk is more straightforward for this purpose. The AST and Type graphs are serialized by separate modules. However, the AST and Type graphs also point into one another. Thus the two modules call into one another at those points, and the tags for the two sets of objects interleave with one another in the output. * Mapping objects to tags All classes map to a tag of the same name that is used to serialize the state of the class. Each member of a class maps to an attribute of its tag. The state of "simple" members, such as a bool, int, or string, are fully captured in the attribute value. The state of "complex" members, such as an object, cannot fit into an attribute string, so we employ a layer of indirection which Scott calls the "Rat's Nest Scheme" as follows. Given a parent pointing to a child: 1) in the parent tag for the attribute pointing to the child a unique id value for the child is written out; 2) the child is serialized as a child-tag nested within the parent tag; 3) the child tag gets a unique "_id" field so that upon de-serialization it can be matched up with the parent field pointing at it. Since in an OO language potentially any object can be pointed to, in practice all tags corresponding objects get an "_id". See below for more on the non-trivial task of generating unique ids. The state of "container" members, such as lists or maps, is dealt with by simply promoting this particular "static instance" of a container to its own tag "class": that is, the tag name is made up of 1) the parent class and 2) the member name of the container in that parent. See below for more on tag naming convention. Containers being classes gives rise to the need for "item" members: anonymous members of the new container "classes". That is, if we just serialize each member of a container, we might hit an object that has been serialized before and therefore will be skipped (see below); there would then be no record at all that it is in this container. Therefore we wrap members of list and map classes with "Item" tags; they just hold a pointer to the contained object in case it doesn't actually get serialized there. These "item" tags are the only ones that do not get their own "_id" tag as during parsing their single attribute is just stored directly into their container parent; they don't have a separate existence in the OO world so no one ever points to them. Circularities in the object graph to not pose a problem: All *pointers* to objects are *always* serialized as an attribute, however an *object* is only serialized *once*. Note that to simplify things, all tags are container tags whether they point to any complex objects or not. **** How de-serialization works Separate modules are also used for de-serializing the AST and Type graphs. However recall that their tags are mixed in a fine-grained way in the Xml. This situation is handled by the use of a design pattern I call: "non-deterministic virtual-dispatch". We take the de-serializers for the AST and for the Types and register them both with the general-purpose class XmlReader. During parsing when a tag or attribute is encountered XmlReader asks both of the registered readers to handle it in turn. Each reader 1) either handles it or not and then 2) returns a boolean saying whether it was able to: that is, we "non-deterministically" do virtual dispatch to both registered reader "subclasses" at once. It is an error if neither can handle it, and the dispatch is really done serially and short-circuits at the first call that says it handled the tag. The consequence is that parsers for multiple Xml sub-languages can be written separately and their tags interleaved. The de-serializer is a stack machine. When we de-serialize a tag, an object is constructed to represent it and pushed onto the stack. Every tag has a special "_id" attribute and when an object is de-serialized it is filed in the id2obj hashtable under its id. When we encounter a close tag, we check that it matches the object on the top of the stack and pop. An attribute is always parsed in the context of the object on the top of the stack. If it is a simple type, we just write it into the appropriate field of the top object. If it is an object id, we create an UnsatLink object to record 1) the id, 2) a pointer to the member of the top object that should be updated to point at the referent object having that id when it is found later during link satisfaction, see below, 3) the type of the pointer so we can deal with up-casting problems that arise with multiple inheritance (see below), and 4) whether this parent/child relationship is a pointer or is one of embedding. When the tags have been de-serialized we satisfy all of the dangling links as follows. For each UnsatLink we 1) look up its id in the id2obj hashtable. If the parent/child relationship is a pointer (as recorded earlier) we 2a1) do an up-cast if needed from the de-serialized object type to the link type, 2a2) do a pointer assignment satisfying the link. If the parent/child relationship is one of embedding we 2b) call operator=() to initialize the object from the temporary one created on the stack (which is no longer needed). Note that there is an alternative possible design for the temporary representation of unsatisfied links during de-serialization. If you create an UnsatLink object for each unsatisfied link it is going to fragment your heap even if you delete them, it is slow to make them all etc. When the parent points to the child rather than embedding it, the unsatisfied link can be simply written as an int into the still-unused pointer to the child. To distinguish it from a real pointer, recall that real points have at least two low-order zero bits; just set them to something else. There is an additional restriction: there are only 3 other possibilities on a 32-bit machine, so only up to three identity prefixes (see below) can be stored this way. When an object is de-serialized we can now just look at the top of the stack and search through every pointer member looking for an id that matches that of the object we just de-serialized and replace it with the real pointer. This scheme reduces the load on the list of UnsatLink-s as state above, and also on the id2obj hashtable. Given the memory locality, it is probably more cache friendly as well. It is currently unimplemented. **** Identity An object needs a unique id both so 1) we can make sure we don't serialize it twice by "checking it off" in some hashtable when we serialize it, and 2) so we can serialize separately a) the reference to an object and b) the object itself. This second feature is needed so that we can deal with circularities in the object graph, but it also makes any graph, even a tree, easier to serialize because we can separate 1) serializing the header node with its references to its children from 2) serializing the children. The consequence is that we can now use a Visitor design pattern to serialize rather than a walk. This design is essential to using a generic lowered traversal to serialize the lowered AST rather than a special-purpose lowered-serializing walk. Such a special-purpose design would be error prone and not a good factoring since both serialization and generating a lowered view of the AST/Types are complex. It turns out that a unique id is harder to come up with than you might thing. The heap address of an object is not a unique id! 1) There are circumstances where more than one object can have the same heap address. 2) There are circumstances where one object can have more than one heap address. As an exercise you might want to come up with an example of each situation before reading further. Two objects with the same heap address: The situation where more than one object can share the same heap address is with inheritance and embedding. If a class A has a member B that 1) comes first in the list of A's members, 2) is embedded, not pointed to, then A and B will share the same heap address. The solution to this is to break symmetry with a system of id-prefixes. The prefix of an object is just a function of its class and we choose prefixes such that no two classes with the same prefix ever embed within each other, even transitively(!); now the ids are unique. Prefixes are documented below. The situation of two objects sharing the same heap address also arises with inheritance, as usually in C++ the superclass is implemented as an anonymous embedded member of the subclass. With single inheritance though, there is an easy trick: never serialize the superclass, just serialize its members as if they were the members of the subclass (which is the illusion that the OO name lookup rules give you anyway); do the reverse during de-serialization. Now, when a pointer points to the subclass as if it were the superclass (the whole point of inheritance) the fact that they share the same heap address is a feature, not a bug. Two heap addresses for the same object: With multiple inheritance, an object can have more than one heap address: if an object A has superclasses B and C, then a B-pointer to A and a C-pointer to A do *not* have the same value as raw ints. The solution has two parts. 1) During serialization, we are sure to always attempt to downcast any B or C to an A and render it as an A if we can so that 1) information is not lost and that 2) the ids in the pointer attribute of the parent and the _id attribute of the child will match. 2) During de-serialization, when we de-serialize the pointer to a B or C, we record that it was a B or C pointer. When we match up the pointer with the de-serialized object later, we up-cast the C object to an A or B before setting the pointer. Recall that since serialization is not type-safe and we are not using the real type-system to manipulate these objects so we have to record what casts are genuinely correct as a separate hand-rolled run-time-type-system as we do this. **** Container non-uniformity Containers within Elsa come in several templatized classes which are in turn instantiated with several container classes. We don't want to make this externally visible in the Xml schema as other analysis tools may have different external representations. We therefore only maintain two kinds of containers externally: a "List", sequence of objects, and a "NameMap", a map from strings to objects. Note that the de-serialization handles this situation by using one kind of list for de-serializing lists, an ASTList, and one kind of name-map for de-serializing NameMaps, a StringRefMap. This one-class-fits-all approach during de-serialization is particularly important for the problematic FakeLists which exist only in the type-system and have no distinct existence separate from their contents. It is also helpful for lists that do not have a constant time append and would have to be prepended to and then reversed. All of this could be handled without the use of ASTList intermediaries at the expense of more complexity in the de-serialization code. However, the situation is more complex than that: when a list tag is initially de-serialized, we construct an ASTList template instantiation appropriate to the kind of objects that will eventually be contained, such as an ASTList. We do not want to special-purpose the code for appending to a list as new list items are encountered, though it could be done. We therefore exploit our internal knowledge of the implementation of ASTList that its contents are pointed to and not embedded. Since all pointers are the same size (on my machine anyway) this makes a diagram commute that should not according to the C++ type system: we simply cast all ASTLists to ASTList and treat them as such for purposes of appending a new element. Recall that there is a final conversion of the list into the actual type that it will be after serialization is complete, which may not even be an ASTList. Something similar is done for name maps. Note that there is a separate problem related to containers in that the ephemeral "item" tags mentioned above which wrap their contents do not map to an object that has permanent existence. When an item close tag is encountered, we simply use the id2obj hashtable right then to look up the pointed to object and insert it into the container; the item object on the stack can then be deleted. (Note that at the *close* item tag, the contained object must have been de-serialized already, either immediately before or before that at some earlier time.) This process prevents item objects from participating in the final linker satisfaction process, which, given their ephemeral existence, would be unnatural. However this scheme has one unfortunate consequence: this is the only place where the id2obj hashtable has lookups called on it before linker satisfaction time. Since there are *no* insertions into the hashtable once linker satisfaction starts if there were also *no* lookups at all before linker satisfaction starts we could employ a faster algorithm for hashtable construction: 1) while inserting into the table, just maintain a list, not a proper hashtable, 2) just at the point where the insertions stop and the lookups start, sort. **** Conventions Tag names for classes are just the class name. For nested classes, we use the fully qualified name and replace the '::' with an underscore. Tag names for containers the following three things concatenated with underscores 1) the kind (List, NameMap, Map), 2) the name of the class that has the container as a member, and 3) the name of the member of that class that is the container. For example the tag of this member class CompoundType { const ObjList bases; } is // Type system list 'class' convention You might wonder why it is not something bottom up instead, which might more naturally make for a re-usable tag, like this: // AST system list 'class' convention We do not do this because two different classes can have a list of BaseClasses that needs to be handled differently, so the de-serialization depends on the containing context, which also contains implicitly the information that it is a list of BaseClasses. Each object has an _id attribute which is unique. Note that the id is basically the heap address prefixed by a letter code; the letter code is to break symmetry for object with the same heap address, such as a class with a first member that is embedded or a non-class such as a FakeList node. Neither the address nor the prefix has no semantics so another program rendering out XML in this schema need not use the same method, but any method for generating unique ids will work. The prefixes we use are "AST" for AST nodes and the rest are given at the top of cc_type_xml.cc at the block of calls to the "identity" and "identityTempl" macros. Each container has a tag that wraps contained items in case the item has been previously serialized and therefore may not appear. They are as follows. Note that they do not correspond to objects constructed internally and therefore do not have an '_id' field. Their unique metadata status is indicated by the leading underscore, similarly to the '_id' field itself for objects. <_List_Item item=id> <_NameMap_Item name= item=id>