A Spur gear for Cog

I’ve got tentative permission at my day job to work on improving Cog’s performance, specifically garbage collection.  The existing Squeak GC isn’t much to write home about; it has an interesting attempt at space optimization but is rather slow because it’s based on a pointer-reversal scan-mark-compact algorithm which, while it avoids the space overhead of a mark stack (implemented in the pointer reversal), ends up writing every field scanned three times, twice during scan-mark, and once during compaction, which is expensive.  Generation scavenging by comparison writes each field just once.  We have recent benchmark results at Cadence that show the VM spending 40% of its time in garbage collection (!!).  While this isn’t a good comparison, in my experience with the VisualWorks VM, HPS, I saw overheads in the 2% to 5% range on memory-intensive loads. In any case a good GC should be a lot cheaper than 40% of entire execution time.

The Squeak object representation also isn’t that spiffy.  It has three header sizes, from a one to a two word header. one-word header objects must be instances of up to 31 classes (the “compact classes”); there is a  bit field in the first header word and if non-zero the class of the object is in a 31-element Array off to the side.  This is a good idea but as we’ll see it doesn’t go far enough.  If the size of the object is less than 254 bytes (and not one of the compact classes), it has a two word header, the class reference taking up a word.  Large objects have a three-word header with the full size taking a word.  The parts of the VM that deal with object access (inst var access, object indexing etc) need to be insulated from this variety; it would be a really bad idea if one had to adjust an inst var offset by the header size whenever fetching the inst var of an object, so an object pointer always points at the first word of the header.  But a memory manager must enumerate objects (e.g. privately for compaction, or as a service for allInstances) and hence needs to be able to look at the word following an object and tell what kind of header its got.  So the Squeak format puts the header size in the least significant two bits of all header fields.  Hence when fetching the class of a two- or three-word header the VM must mask off the least significant two bits.

All this complexity has two main impacts on the Cog JIT.  First, the inline cache probe at the beginning of each method, the code that checks that the receiver’s class is what’s expected, is about twice as long as that in HPS because it must cope with compact classes and with stripping the least-significant two bits from a full class field.  Second, implementing basicNew or at:put: in machine code is a too tedious to contemplate; every object instantiation (including block creation) and every at:put: involves calling out of machine code into the interpreter’s run-time to use the existing Squeak VM code, which is doubly slow, slow because the switch from machine code to C involves switching stacks, and slow because the existing run-time code is already complex, given that the object representation is intrinsically complex.

So revamping the garbage collector and changing the object representation appear to offer about a factor of two in performance, in part from cutting into the 40% GC overhead, and in part from the object representation, a much more efficient implementation of at:put:, basicNew  and block creation, and a shorter simpler method prologue.  If you look at the computer language benchmark game you’ll see VisualWorks is roughly twice as fast as Squeak on Cog.

OK then, so what does this mean concretely?  I’m changing the garbage collector and the object representation, a project I’ve nicknamed Spur (geddit??).  The GC is being replaced by a generation scavenger, which is well-known technology.  The object representation is influenced by the scheme I came up with when implementing a 64-bit version of VisualWorks, and I’ll spend some time describing it here.  The main problem with a 64-bit Smalltalk object representation is that blowing a whole word on a class reference implies at least a 16-byte header, 8 bytes for the class and 8 bytes for the remaining information, and that’s a lot of memory to spend on an object header, given that statically the average Smalltalk object has about 7 inst vars (there are a lot of methods in the image), and dynamically probably a lot less (things like Point and Rectangle, which have two inst vars each, are going to be instantiated much more frequently using the IDE than methods).  And the way out is … the class index idea.  Use a field in the header to contain a class index and hold classes off to the side.  Given that Smalltalk VMs of all stripes, interpreted or compiled, contain extensive method cacheing, the slower class access isn’t an issue.  The VM uses the class index as the cache tag in the inline and global method caches, indirecting to fetch the actual class object only when method lookup in the cache fails and the class hierarchy must be searched (slow anyway, adding an extra indirection won’t be noticed), or in the class primitive.

There are knock-on advantages.  Class indexes are constant; the GC doesn’t update them when class objects are moved during garbage collection, and so they don’t need to be visited in in-line caches in machine code during garbage collection.  Class indexes are constant; when instantiating a well-known class such as Array, Message or BlockClosure, the VM doesn’t need to go look up the class object in a table, it simply uses the appropriate class index.  But, you ask, isn’t general instantiation really slow because with a class in hand, wanting to instantiate it, don’t we have to search the class table looking for the class to determine its class index?  No…

Another issue in the Squeak system, and many Smalltalk VMs that use a GC that moves objects, is the identityHash.  The system must provide a per-object hash that doesn’t change as an object moves.  In the original 16-bit Smalltalk-80 implementation there was a fixed object table containing  object headers, each with a pointer to their object body.  The body could move but the header never did and so the object’s address could serve as its hash.  But in 32-bit VisualWorks there’s a 14-bit hash field and in Squeak it’s only 11 bits, and that means lots of collisions in IdentityDictionary et al. So in my class index design there’s a much bigger identityHash field, and intentionally it is the same size (*) as the class index, and the system arranges that a class’s identityHash is also its index in the class table.  The first time a class gets instantiated or sent a message or put in an identity hash table the VM notices the class has no identityHash yet, finds an unused slot in the class table, and assigns this index as the class’s identityHash field.  So to instantiate a class the VM copies the class’s identityHash to the new instance’s classIndex field; simple. (*) in fact it needs only to be as large as the class index. If one needs header bits or a better identityHash one could reduce the size of the classIndex field and increase the identityHash field. All that’s needed is that a classIndex fit in the identityHash field.

One really good idea in the Squeak object representation is the format field.  This is a 4-bit per-object field that defines the object’s basic type, is it pointers, bits, indexable, etc, and if bytes, how many unused bytes in the object, the “odd bytes”, given that its header size is a word count.  The existing Squeak format is

ObjectMemory methods for header access 
formatOf: oop 
" 0 no fields 
1 fixed fields only (all containing pointers) 
2 indexable fields only (all containing pointers) 
3 both fixed and indexable fields (all containing pointers) 
4 both fixed and indexable weak fields (all containing pointers). 

5 unused 
6 indexable word fields only (no pointers) 
7 indexable long (64-bit) fields (only in 64-bit images) 

8-11 indexable byte fields only (no pointers) (low 2 bits are low 2 bits of size) 
12-15 compiled methods: 
# of literal oops specified in method header, 
followed by indexable bytes (same interpretation of low 2 bits as above) 
        <inline: true> 
        ^((self baseHeader: oop) >> 8) bitAnd: 16rF

What’s good here is that a lot of properties, usually implemented as single-bit flags, some of which are exclusive, are combined in a single field.  These bits would be a pointers bit, unset in byte objects, an indexable bit, 0 in e.g. Point, (and often stored in the class, from where its slow to get, slowing down at: & at:put:), an isWeak bit, set in WeakArray et al, an Ephemeron bit, and two bits for the number of odd bytes.  So instead of 5 or 6 bits we have 4, and these bits are often related during at: and at:put: (isPointers and odd bytes are both relevant).

So I’m keeping this scheme, but extending it to cope with 64-bit objects.  Another design goal of Spur is to share as much of the object representation between a 32-bit and a 64-bit implementation as possible.  In VisualWorks there’s a lot of difference between the two, 32-bits having direct class pointers, 64-bits having class indexes, and that means a complex code base that’s sometimes hard to read.  Hence the Spur format field is

Spur32BitMemoryManager methods for header access 
formatOf: objOop 
    "0 = 0 sized objects (UndefinedObject True False et al) 
     1 = non-indexable objects with inst vars (Point et al) 
     2 = indexable objects with no inst vars (Array et al) 
     3 = indexable objects with inst vars (MethodContext AdditionalMethodState et al)
     4 = weak indexable objects with inst vars (WeakArray et al)
     5 = weak non-indexable objects with inst vars (ephemerons) (Ephemeron)
     6,7,8 unused
     9 (?) 64-bit indexable 
     10 - 11 32-bit indexable 
     12 - 15 16-bit indexable 
     16 - 23 byte indexable 
     24 - 31 compiled method" 
    self flag: #endianness. "longAt: objOop + self wordSize in a big-endian version" 
    ^(self longAt: objOop) >> self formatShift bitAnd: self formatMask

Now we can define the complete Spur header.  There are two formats; an 8-byte header, common between the 32-bit and 64-bit implementations, and a 16-byte header which has the overflow size in an 8-byte field prepended to the standard 8-byte header:

SpurMemoryManager methods for header format 
headerForSlots: numSlots format: formatField classIndex: classIndex 
    "The header format in LSB is 
     MSB:    | 8: numSlots              | (on a byte boundary) 
             | 2 bits                   | 
             | 22: identityHash         | (on a word boundary) 
             | 3 bits                   | 
             | 5: format                | (on a byte boundary) 
             | 2 bits                   | 
             | 22: classIndex           | (on a word boundary) : LSB 
     The remaining bits (7) need to be used for 
     leaving 2 unused bits."
    <returnTypeC: #usqLong> 
    ^ (numSlots << self numSlotsFullShift) 
    + (formatField << self formatShift) 
    + classIndex
That’s room for 4 million classes, 4 million identityHashes, and objects with up to 1020 bytes before they need an overflow size field (2040 in the 64-bit system).  While the 4 million class limit will be breached some day, it’ll stand for a few years yet, so it shouldn’t be as alarming as Bill’s famous 64k byte remark.  For object parsing the overflow size word also contains a numSlots field.  If the numSlots field is maxed out at 255 slots, then there’s an overflow size word and its numSlots is also 255.  If the word following an object has 255 in the most significant byte then that following object has a 16-byte header, 8 bytes otherwise.
This is a work in progress so I’ve yet to write the instantiation routine, but hopefully you can see its a lot simpler than this:
    ObjectMemory methods for allocation 
    instantiateClass: classPointer indexableSize: size 
        "NOTE: This method supports the backward-compatible split instSize field of the 
        class format word. The sizeHiBits will go away and other shifts change by 2 
        when the split fields get merged in an (incompatible) image change." 
        | hash header1 header2 cClass byteSize format binc header3 hdrSize sizeHiBits bm1 classFormat | 
        <inline: false> 
        self assert: size >= 0. "'cannot have a negative indexable field count" 
        hash := self newObjectHash. 
        classFormat := self formatOfClass: classPointer. 
        "Low 2 bits are 0" 
        header1 := (classFormat bitAnd: 16r1FF00) bitOr: (hash bitAnd: HashMaskUnshifted) << HashBitsOffset. 
        header2 := classPointer. 
        sizeHiBits := (classFormat bitAnd: 16r60000) >> 9. 
        cClass := header1 bitAnd: CompactClassMask. "compact class field from format word" 
        byteSize := (classFormat bitAnd: SizeMask + Size4Bit) + sizeHiBits. 
                 "size in bytes -- low 2 bits are 0" 
        "Note this byteSize comes from the format word of the class which is pre-shifted 
                 to 4 bytes per field. Need another shift for 8 bytes per word..." 
        byteSize := byteSize << (ShiftForWord-2). 
        format := self formatOfHeader: classFormat. 
        format < 8 
                 [format = 6 
                     ifTrue: ["long32 bitmaps" 
                         bm1 := BytesPerWord-1. 
                         byteSize := byteSize + (size * 4) + bm1 bitAnd: LongSizeMask. "round up" 
                         binc := bm1 - ((size * 4) + bm1 bitAnd: bm1). "odd bytes" 
                         "extra low bit (4) for 64-bit VM goes in 4-bit (betw hdr bits and sizeBits)" 
                         header1 := header1 bitOr: (binc bitAnd: 4)] 
                     ifFalse: [byteSize := byteSize + (size * BytesPerWord) "Arrays and 64-bit bitmaps"]] 
                 ["Strings and Methods" 
                 bm1 := BytesPerWord-1. 
                 byteSize := byteSize + size + bm1 bitAnd: LongSizeMask. "round up" 
                 binc := bm1 - (size + bm1 bitAnd: bm1). "odd bytes" 
                 "low bits of byte size go in format field" 
                 header1 := header1 bitOr: (binc bitAnd: 3) << 8. 
                 "extra low bit (4) for 64-bit VM goes in 4-bit (betw hdr bits and sizeBits)" 
                 header1 := header1 bitOr: (binc bitAnd: 4)]. 
        byteSize > 255 "requires size header word/full header" 
            ifTrue: [header3 := byteSize. hdrSize := 3] 
            ifFalse: [header1 := header1 bitOr: byteSize. hdrSize := cClass = 0 ifTrue: [2] ifFalse: [1]]. 
        ^self allocate: byteSize headerSize: hdrSize h1: header1 h2: header2 h3: header3 doFill: true format: format 

    allocate: byteSize headerSize: hdrSize h1: baseHeader h2: classOop h3: extendedSize doFill: doFill format: format 
        "Allocate a new object of the given size and number of header words. (Note: byteSize already 
         includes space for the base header word.) Initialize the header fields of the new object and 
         fill the remainder of the object with a value appropriate for the format. May cause a GC" 

        | newObj remappedClassOop | 
        <inline: true> 
        <var: #i type: 'usqInt'> 
        <var: #end type: 'usqInt'> 
        "remap classOop in case GC happens during allocation" 
        hdrSize > 1 ifTrue: [self pushRemappableOop: classOop]. 
        newObj := self allocateChunk: byteSize + (hdrSize - 1 * BytesPerWord). 
        hdrSize > 1 ifTrue: [remappedClassOop := self popRemappableOop]. 

        hdrSize = 3 
            ifTrue: [self longAt: newObj put: (extendedSize bitOr: HeaderTypeSizeAndClass). 
                self longAt: newObj + BytesPerWord put: (remappedClassOop bitOr: HeaderTypeSizeAndClass). 
                self longAt: newObj + (BytesPerWord*2) put: (baseHeader bitOr: HeaderTypeSizeAndClass). 
                newObj := newObj + (BytesPerWord*2)]. 

        hdrSize = 2 
            ifTrue: [self longAt: newObj put: (remappedClassOop bitOr: HeaderTypeClass). 
                self longAt: newObj + BytesPerWord put: (baseHeader bitOr: HeaderTypeClass). 
                newObj := newObj + BytesPerWord]. 

        hdrSize = 1 
            ifTrue: [self longAt: newObj put: (baseHeader bitOr: HeaderTypeShort)]. 
        "clear new object" 
        doFill ifTrue: 
             [| fillWord end i | 
              fillWord := format <= 4 
                              ifTrue: [nilObj] "if pointers, fill with nil oop" 
                              ifFalse: [0]. 
              end := newObj + byteSize. 
              i := newObj + BytesPerWord. 
              [i < end] whileTrue: 
                  [self longAt: i put: fillWord. 
                   i := i + BytesPerWord]]. 
But will it blend?  I’ve bootstrapped from the old format to the new format so I can show you what happens to the size of the heap.  Take a guess.  There is additional overhead when compared to the existing Squeak object representation.  A further departure is that objects are always a multiple of 8 bytes, and always have at least one field past the header for forwarding, specifically for implementing lazy become:. So in the 32-bit system a zero-sized object occupies 16 bytes, 8 bytes for the header, and 4 bytes for the forwarding pointer, rounded up to 8 bytes to preserve 64-bit alignment.  In the 64-bit system it’s of course also 16 bytes, one extra 8 byte field for the forwarding pointer.  And in Spur, Characters are 30-bit immediate values, with tag pattern 2, whereas SmallIntegers are unchanged, 31-bit immediate values with tag pattern 1.   Let’s compare the sizes of the heaps; for this I’m using a Squeak4.3 image containing 342533 objects occupying 15585400 bytes, 45.5 bytes, or 11.37 words per object.
The new image contains 348104 objects, occupying 17844440 bytes.  There are fewer Characters but some extra space for the root and two pages of the sparse class table.  So there’s a significant increase in heap size of 14.5%, but not a huge increase given Spurs other advantages (*).  Let’s take a look at the overheads
| l s z o d r | 
l := s := z := o := d := 0. 
self allObjectsDo: 
    [:j| | n | 
    n := self numSlotsOf: j. 
    n odd ifTrue: [d := d + 1].                    "d is number of odd-word sized objects, 138,919" 
    n >= self numSlotsMask 
         ifTrue: [l := l + 1]                      "l is number of large objects with an overflow size field, 963" 
              [n = 0 
                   ifTrue: [z := z + 1]            "z = zero-sized objects, 10,642" 
                        [n = 1 
                             ifTrue: [o := o + 1]  "o = one-slot sized objects, including 1,2,3 & 4 byte strings, 54,502" 
                                  [s := s + 1]]]]. "s = small objects with no overflow size field, 281,997" 
r := { l. s. o. z }.                               "sum of r is total number of objects" 
r, {d}, (r, {d} collect: [:v| v * 100.0 / r sum roundTo: 0.01]), { r sum } 

=> #(963 281997 54502 10642 138919 0.28 81.01 15.66 3.06 39.91 348104)

So the overhead for the forwarding pointer is z * 8 = (10642 * 8), a percentage overhead of 54376 * 100.0 / 17844440 or 0.48%.  The overhead for rounding-up to 64-bits is d * 4 = 138919 * 4, a percentage overhead of 3.1%.  The number of zero-sized objects rounded up to make room for their forwarding pointer is 10642, an overhead of 10642 * 8 * 100 / 17844440, or 0.48%. Interestingly the number of odd slot objects is 15.7%, significantly less than the expected 50%.  Do we software engineers like powers of two or does binary computing encourage them? The saving on Characters is almost negligible; there are only 256 characters in the image I started with (no static occurrences of wide characters), that’s 768 bytes + 1032 bytes for the Character table.  So in summary, the forwarding pointer and 64-bit alignment impose a 3.6% overhead. Spur loses space because it uses 8 bytes of header for almost all objects whereas the old format’s 1-word header for small instances of the compact classes manages to squeeze fully 62.5% of objects into the small header (the Squeak 4.3 image I used has 213947 1-word header out of 342533 objects). So given how effective the old scheme is, 14.5% heap growth isn’t so bad.

(*) an earlier version of this post contained a dreadful clerical error. I mistakenly subtracted the size of an empty space from the total Spur heap size and hence claimed a -2.4% shrinkage in heap size. Apologies; forgive me. One of the hazards of web publishing is the lack of review.

I’d like to dedicate Spur to Andreas Raab, my dear friend and challenging and supportive colleague, who gave me the chance to implement Cog in the first place.  Andreas died so young of a stroke earlier this year, but wonderfully he has a son.  Kathleen may your life with Theodor be filled with joy!  Andreas I miss you.

   Send article as PDF