Joel Spolsky writes about the relatively poor performance of the Ruby interpreter:
Without knowing much about the implementation of Ruby, I would guess that the biggest issue is around late binding and especially duck typing, which prevents type inference or strong typing, which means that function calls will always be slow because you can never get something compiled down to the point where a function call is just a single CALL instruction (on x86)… you always have to be exploring the object, possibly even scanning a hash table to find the function you want to call. Calling methods on objects is really, really, really, really common, especially in pure OO languages, and that bit of the code needs to get down to at least a CALL indirect on the CPU for Ruby to offer the same performance as languages where the type of the object can be determined at compile time and therefore the instruction where you’re jumping to can be gotten either at compile time (like in C) or with a single indirection through a single vtable (like in C++).
Joel’s wrong. Yes, Ruby is slow, and yes, that has to do with Ruby’s naive implementation of method dispatch, but no, it’s simply not true that you can “never get something compiled down” to the point where it’s fast enough. A proper implementation of a duck typed language will get method calls down to a single jump plus a single compare, 95% of the time. There’s no vtable indirection needed, which means that it’s generally faster than C++, because branch prediction works and the pipeline doesn’t stall.
How does that work? Well, even with duck typing, almost all method call sites have a single implementation that’s used way more than any other (say, a default superclass implementation that’s occasionally overridden by subclasses). So a proper compilation of that method call will involve first a JMP to the most likely implementation, followed by a CMP to see if you’re actually in the right place. Only if you’re not do you have to go through the full hash table lookup Joel is describing, and that happens so rarely, the extra time it takes gets practically lost in the noise.
How do you know what the most commonly used implementation is? You don’t need static analysis, just runtime profiling: the first few hundred times you do a method dispatch, store statistics about the results at the call site. Eventually, you can replace that call with the optimized version described above.
This technique isn’t secret. In fact, Sun has just today open sourced the Strongtalk VM, which, if I’m not mistaken, contains a nice implementation of exactly what I’m describing. The thing that’s most interesting about Strongtalk is that it does allow optional static type annotations, but the optimizer completely ignores them: your code runs exactly as fast if you duck type everything as if you statically type everything. That’s not because the implementors were lazy - their goal was to produce the fastest system possible, it’s just that their dynamic implementation was already fast enough that the static information didn’t help.
As it happens, the Strongtalk VM would make an excellent basis for the next generation Ruby VM. Given that Sun still employs some of the original Strongtalk engineers, and has just hired the JRuby guys, maybe they can make that happen?
[Update: I was in a rush and didn’t have any references on hand, but some ended up in the comments anyway. For those who are interested in exploring this deeper, the original papers on this came out of the Self project; this is probably the most relevant one: http://research.sun.com/self/papers/pics.html. Note that it’s from 15 years ago, which is why some of us get saddened by how little the perception of dynamic language performance has changed in that time.]
Comments