Lazy Become and a Partial Read Barrier for Spur
One of the stated design goals for Spur is to support pinned objects via lazy become. This is one of the less well-thought-out parts of the design. If I can make lazy become work it’ll be (as I hope to convince you in this article) a significant improvement over the current Squeak VM’s become implementation. But it must work, and to that end I thought I’d write a post describing my current design of the scheme, and allow y’all to criticise it, either shooting it down in flames, saving me the effort of tilting at windmills, or helping fill in the blanks and offer improvements to help the scheme’s success. So let’s have a tilt at describing become:.
Become is Smalltalk-speak for an operation that changes all references to one object into references to another. There are two becomes. two-way, where two objects exchange identities, and one-way where references to a first object are replaced by references to a second object, and the first object becomes inaccessible, to be swept away by the garbage collector. This operation has traditionally been used to grow collections in Smalltalk, and to update live instances when a class definition changes, adding or removing variables. It can also be used to replace a movable copy of an object with a non-movable copy in some more suitable space, for example effectively morphing a movable object in new space with an immobile one in old space. Being able to pin objects down is invaluable in a Foreign Function Interface (FFI) where one wants to pass storage to other components.
The traditional implementation of become in the earliest Smalltalk-80 implementations was based around a two-part object representation. The system had a table of object headers, all of the same size, and each header contained some information used by the garbage collector (contents contain pointers or not, etc) and a pointer to the body of the object, the body containing the object’s class, its size, and its fields. The two-way become operation is simple with this representation; simply swap the pointers to the object bodies, plus whatever other header bits are relevant. Here there is effectively a read barrier on every class and instance variable access, through the pointer, so-called “object table indirection”. The VisualWorks memory manager still uses this scheme, updated to work with a generational collector.
Subsequently, Digitalk’s Smalltalk/V used a flat object representation without object table indirection, and hence provided a one-way become which scanned through memory, replacing any reference to the first object with a reference to the second.
The existing Squeak/Cog garbage collector has a flat object representation, and supports both two-way and one-way bulk becomes. The input to the become primitive is two arrays of objects, and the become is pair-wise, performing either a two-way or a one-way become on each pair of objects in the two arrays. This is implemented above the VM’s compaction system which stops the world, allocates some forwarding blocks, replaces the headers of objects to move or become, with pointers to their forwarding blocks, which contain their final position after become or compaction, scans memory following object references through forwarding blocks updating pointers, and then, if compacting, moving objects, finally discarding the forwarding blocks.
All approaches support grow, and instance migration (although IIRC Smalltalk/V didn’t migrate instances on class redefinition), but while the cost of the object indirection version is O(1), with hidden costs in object table indirection, the flat implementation is O(n) in the size of the heap, and in the age of 64-bit machines that’s a problem.
An alternative implementation, oft-used in Lisp systems, is to add a read barrier to all object access, and mark objects as forwarders. This can be used to implement lazy copying garbage collection where objects are copied from one semi-space to another in parallel to the main program (the “mutator”). To become, or move an object one replaces the object’s header or first field with a forwarding pointer to the desired target or copy in a new location, marking the “corpse” as forwarded. The program checks the forwarded flag on each access. If there is hardware support, as in a Lisp machine, this can work well. But without hardware support, and like the object table representation, it has costs, slowing down program execution due to the scattering of forwarding checks and forwarding pointer follows throughout program execution.
The Spur Conjecture
Smalltalk provides strict object encapsulation; the only way to access the state of an object is to send it a message. OK, bald-faced lies are very popular these days, but with Smalltalk there’s more than a grain of truth here. The only violations of this principle are though certain primitives. The add primitives, for example, be it on SmallIntegers (fixnums) or Large integers (bignums) access the state of both the receiver and the argument in computing the result. One might think that instVarAt: and instVarAt:put:, which can access the instance variables of any object from the outside, violate encapsulation, but one still has to send the message to the object one is accessing. Mirror primitives, however, do violate encapsulation, allowing one to reach inside an object without sending a message to it, passing the object to be accessed as an argument to the mirror primitive. The up side is that mirrors provide security, they are capabilities that only certain parts of a system are trusted with; instVarAt: and instVarAt:put: declare open season on internals and are thankfully used with discretion in Smalltalk systems.
Given strict encapsulation it seems to me that we can rely on message sending to follow forwarding pointers. (almost) Every message send in Smalltalk involves some form of check of the class of the receiver. In the abstract, a message is activated by fetching a receiver object’s class, and searching the method dictionaries in the class’s hierarchy until a matching method is found. In practice, the process is optimized using method lookup caches. Every send involves fetching either the object’s class, or a unique class identifier (e.g. an index into a class table) from the object and comparing it against a possibly matching tag in a cache. If they match, the right method is in the cache. If they don’t match a full lookup is needed. By arranging that forwarding objects have their class tags changed, the message send cache probe becomes a forwarding object trap. Whenever a message is sent to a forwarding object the cache probe will fail and the failure path can follow the forwarding pointer.
The scheme then implements pinning and become using forwarding pointers. The old space collector will not move pinned objects, and to avoid pinned objects filling up new space, an unpinned young object is pinned by allocating a pinned copy in old space and marking the young object as a forwarding object pointing to the copy. Two-way become is implemented by either swapping contents if the two objects have the same size, or allocating two copies, and forwarding the originals to the exchanged copies. One-way become is implemented by marking the first object as forwarded, pointing to the second object. This seems great in principle, but there are a number of pitfalls, not just the primitive argument one mentioned above.
My Spur conjecture is that using a message send trap is an effective approach because the pitfalls are limited and manageable. So what are the pitfalls?
Falling With Style
The first pitfall is that Smalltalk accesses two other objects far more than the receiver, which are the current method and the active context. Smalltalk-80’s execution model executes bytecodes and references literals stored in the current CompiledMethod object. In most good Smalltalk implementations, methods are compiled on-the-fly to machine code and the machine code is executed. In fact, Cog is a hybrid interpreter+JIT, interpreting essentially only at start-up. So the becoming of methods is really only an issue in the interpreter. Smalltalk-80’s execution model also provides access to the underlying activation record, a context, and in a fast Smalltalk implementation (and Cog’s not too shabby) contexts are essentially virtual, created on-demand only when the program asks to see them; the VM’s actual activation records are more-or-less conventional stack frames, and contexts are proxy objects for them. And while message sending is heavily optimized, when a method cache probe does fail the class hierarchy of the receiver is searched, and any object there-in could have been becomed.
The more-or-less conventional stack frames are housed not on the machine stack but in a stack zone of quite limited size, organized as a set of pages. The stack zone acts much like a young generation for contexts, storing the most recently active N stack frames. While the number of pages in the stack zone is a settable parameter, the zone’s size is typically less than 100 kb. So it is practicable and quick to scan the stack zone and follow forwarding pointers after a become operation. We can know that the receiver in a stack frame is never a forwarding pointer and hence not need a read barrier when reading its instance variables. We can know that the method in an interpreted frame is never a forwarding pointer and hence not need a read barrier when interpreting bytecodes or accessing literals. However, a literal itself may be a forwarded object, and so, for example, we may have to follow the message selector on a message probe miss. We must scan the first-level method lookup cache and remove forwarded entries there-in; again the first-level method lookup cache is quite small, perhaps 4k entries.
To avoid checking class fetch (where the class table is indexed by an instance’s class index to locate its class object for method lookup) we can scan the class table after a become operation. There can be up to 4M classes in Spur, but this is still a tiny fraction of a large heap – ignoring page density, faulting in 4M pages could be expensive ;-). But in the common case the class table contains a few thousand classes, most of them in the first few pages of the table, and that’s a tiny amount of memory to scan. There is a wrinkle; the class table contains only classes that have been entered into the table, either by being instantiated, or by being entered into a hash table (this because a class’s index into the table is its identityHash). But method lookup traverses a class’s superclass chain, and e.g. abstract superclasses on the chain may not have been entered into the table. So the scan for forwarding pointers needs to follow the superclass chain of each class in the table until a class with an index is found; classes with indices are in the table so will be seen in scanning the table.
In Squeak super sends are implemented using a method’s method class association; this is its last literal and is an association from class name to class object of the class that owns the method, the method’s method class (quite different from the method’s class). A superclass send fetches the value of the method class association and then the superclass of the method class, and starts the lookup in this superclass. To avoid a read barrier on accessing the superclass, the stack zone scan needs to follow the method class association to the class.
Since the stack zone houses only some activation records, return from the base stack frame in a stack page may attempt to return out of the stack zone into a context. At this point the VM “faults in” the context into a new stack page (which I term marrying it with its stack frame). This process must now check for forwarding pointers throughout the context and the method’s method class association. Since marrying a context is an expensive operation this adds hopefully negligible overhead.
Access to literals other than the method class association are of three kinds, message selectors, global variable associations, and ordinary literals. Let’s consider these in turn.
- If a message selector is a forwarding pointer then, provided that we scan the method cache and follow there-in, the forwarded message selector will cause a cache miss and the message selector can be followed prior to the class-hierarchy lookup.
- Smalltalk implements global variable access by using as a literal the Association for the variable that is owned by some global dictionary (a key,value pair e.g. of class name to class object, e.g. in Smalltalk or a class’s classPool). push/store/popLiteralVariable all fetch a literal, and either read or write the literal’s value field. The fetch of the literal needs an explicit check (otherwise we would have to scan all literals in all methods in the stack zone, and the entire method on return, and global variables are relatively rare; in my work image 8.7% of literals are globals).
- ordinary literals are simply objects pushed onto the stack. These, like the referents of the receiver’s inst vars may be forwarded. No read barrier is needed here; access will be handled further down the line in method activation.
With the above approaches all bases appear to be covered except for primitives (have I forgotten anything else?). And the primitive issue, which we introduced at the beginning of this section, is access to sub-state of the receiver and/or access to arguments. These may be forwarding pointers.
First of all, the message send trap catches access to the receiver itself; this cannot be a forwarding pointer, and if the receiver is a flat non-pointer array such as a Float, a ByteArray or a String (all implemented as arrays of bits), the state cannot be forwarding pointers.
Secondly, not all state-accessing primitives are affected. e.g. in at:put: (almost *) no checks are required because the receiver will have been caught by the message send trap, the index is a SmallInteger (an immediate, and immediates can’t be becomed), and the argument is either an immediate Character (in String>>at:put:) or a possibly forwarded object that will simply be stored into the array; i.e. the last argument’s state is inspected only if it is an immediate. (*) the exception is 64-bit indexable and 32-bit indexable arrays, these are floats & bitmaps which could take LargeIntegers whose contents are copied into the relevant field). But e.g. in beCursor extensive checks are required because the primitive inspects the state of the receiver to some depth, accessing a couple of form instances, and a point that are sub-state of the Cursor object.
How should the design solve the issue of state access in primitives? One approach would be an explicit call in the primitive, to follow the pointers in the receiver, and in accessing the arguments from the stack. This can be made convenient via providing something like ensureNoForwardingPointersIn: obj toDepth: n, which in beCursor’s case would look like interpreter ensureNoForwardingPointersIn: cursorObj toDepth: 3 (the offset point of the mask form). Another approach would be to put an explicit read barrier in store/fetchPointer:ofObject:[withValue:] et al, but provide an additional api (e.g. store/fetchPointer:ofNonForwardedObject:[withValue:] et al) and use it in the VM’s internals. The former approach is error-prone, requiring extensive testing, but the latter approach is expensive, potentially ugly, touching nearly all of the core VM code. It would appear that one of these two approaches must be chosen. I favour the former, but I could be quite mad.
Finally, the Squeak VM cognoscenti amongst you will know that the Squeak VMs support primitive plugins that can be dynamically-loaded and interface to the VM via interpreterProxy. interpreterProxy’s API is a simple place to introduce read barriers as required.
OK, please respond with conversation, contradiction, criticism, positive or negative. OK, please respond with conversation, contradiction, criticism, positive or negative.