I’m delighted to say that Qwaq has taken me on to write a fast Croquet VM and that the VM is to be released under the Qwaq open source licence (an MIT license). I’m going to blog about the VM here as I implement it. The blog is a chance for me to record design decisions as I go along, receive better ideas from you, dear reader, and to hand out the whitewash.
This VM will have a dynamic translator, or JIT for short. It will dynamically compile Smalltalk bytecodes to machine code transparently to the programmer, and execute this machine code instead of interpreting bytecode. Initially I’m aiming at performance equivalent to the VisualWorks VM, hps. This VM should execute pure Smalltalk code some 10 to 20 times faster than the current Squeak VM. Subsequently I hope to do adaptive optimization and exceed VisualWorks’ performance significantly. I expect to be able to release the first fast version within a year.
The VM is called Cog after the Honda advert. Fast VMs tend to have something of the Heath Robinson or Rube Goldberg about them. But one of my rules in Cog (rules are meant to be broken) is to do the simplest thing that could possibly work (TSTTCPW). My principles for Cog, in the sense of fundamental propositions governing my behaviour, are that
- any changes in the image to accomodate the fast VM will not slow down the standard Croquet VM (the interpreter)
- any changes in the image to accomodate the fast VM will not increase overall footprint
- the fast VM will be compatible with the Hydra VM
- there will be some form of source code compatibility for Slang plugins so they can function both in the interpreter and in the fast VM
I want the Cog to be used by as many people as possible, but I realise there are those interested in small machines or in VM experimentation for whom the interpreter will still be a better choice. I hope the principles above will avoid schism and mean the Squeak community can still use a common image format and preserve image portability.
That said I do intend to change the image format somewhat. I’ve already implemented closures. These, when implemented appropriately, have a significant impact on the performance of a JIT VM. This has meant a slight change in the bytecode set; gone are the 6 experimental bytecodes 139 through 143. I’ll blog on closures in a subsequent post quite soon.
Hydra is an interesting approach to concurrency, and something Qwaq is interested in. BTW, I think Hydra would better be called String, strings being composed of threads or fibers. String, like a cog, is typically a small part of a larger whole.
For the detail-oriented let me dive down a level and say a little more about the project, especially preparatory changes that pave the way for a fast VM. I apologize for the density. I’m interested in knowing what level of detail I should go into. Obviously people who want to collaborate on Cog will want a high level of detail. But if the architecture of a fast VM is of more general interest then I need plenty of feedback to help me explain things at the right level of detail. Please feel free to comment on this!
OK then, some details. Both TSTTCPW and the Principles imply keeping the existing CompiledMethod format. In Squeak, as in Blue Book Smalltalk-80 CompiledMethod is an anomalous hybrid, its first part being object references used mainly for literals, and its second part being raw bytes used mainly for bytecodes. This is a good choice for an interpreter since both literals and bytecodes can be fetched from the same object, but it causes minor complications in the garbage collector and gets in the way of adding instance variables to CompiledMethod and hence subclassing. I’ll blog on why I’m keeping the format and how to get round its limitations soon.
The existing bytecode set is all very well but inflexible. It has pressing limits on the size of blocks (1023 bytes of bytecode) and the number of literals (255). Being designed for a 16-bit system it misses some compact encoding opportunities, such as a bytecode that pushes a given SmallInteger other than -1, 0, 1 & 2. Pushing 3 requires a minimum of 1 byte for the push literal bytecode plus 4 bytes for the literal slot holding the object 3. A two byte bytecode could push -128 to 127 but take only 2 bytes and be faster. Luckily migrating to a new bytecode encoding is quite easy. Most opcodes – a name I’ll use for an abstract operation such as pushInstVar: – occur in multiple bytecodes – a name I’ll use for a specific encoding such as 0-15, 0000iiii, Push Receiver Variable #iiii. For all short-form encodings there is an equivalent long-form, and one can modify the compiler to generate only the long forms, allowing the ranges allocated to short-form encodings to be reused. I’ve already recompiled a 3.8 image to long-form.
One of the most important optimizations in a fast VM is that of context-to-stack mapping. Conceptually every method or block activation in Smalltalk is represented by a context object, each of which has its own small stack holding its temporaries and intermediate results. Each context points to its caller via a sender slot. This has lots of advantages, such as ease of implementing the debugger, implementing an exception system, being able to implement “exotic” control structures such as coroutines and backtracking, and persisting processes, all with no VM support beyond unwind-protect. The downsides are that a naive implementation incurs considerable overhead. Each send involves allocating a new context and the moving of receiver and arguments from the caller to the callee context. Each return involves (eventually) reclaiming the returned-from context, unless of course something has kept a reference to it. Compared to stack organization in conventional languages, contexts are sloooow.
The idea in context-to-stack-mapping (you’ve guessed it, details in a subsequent post) is to house method and block activations on a conventional stack, hidden in the VM and invisible to the Smalltalk programmer. Sends look more like conventional calls, passing arguments in the stack in the conventional way, the callee directly accessing the arguments pushed by the caller on to the stack. The VM then creates contexts only when needed. To the Smalltalk programmer contexts still exist, and have the same useful semantics as ever, but the overheads for the common case (send and return) are much reduced. While context-to-stack mapping is essential to a fast VM that still provides contexts, it is also useful in an interpreter. Andreas Raab suggested I implement context-to-stack mapping in the interpreter as a milestone on the way to the JIT VM. I think this is a great idea. It should provide a decent speed-up to the Squeak interpreter and help motivate things like the closure implementation.
- restructure the bytecode compiler to decouple parse nodes from specific bytecode encodings (complete)
- replace non-reentrant BlueBook blocks with closures using the restructured compiler (~ 75% complete, currently working on temp names in the debugger)
- implement context-to-stack mapping in the interpreter (to be released Septemberish)
- implement a JIT to replace the interpreter on x86 (to be released Aprilish)
- design a bytecode set that uses prefix bytecodes to eliminate limits and provides encodings that provide benefits for 32 and 64-bit images
- provide the ability to subclass and add/remove instance variables to CompiledMethod and subclasses while retaining the compact interpreter-friendly hybrid format
- target the JIT to other ISAs such as ARM, THUMB, PowerPC and x86-64
- general VM improvements such as
- add per-object immutability support, work I’ve already done for Squeak at Cadence, and previously for the VisualWorks VM
- move garbage collection out of allocation and into the primitive failure code for new, new: basicNew basicNew: et al so that the VM doesn’t have to deal with moving pointers all over the place. - add primitive error codes so the reason for a primitive’s failure is communicated