Optimizing script language performance with custom memory allocators

The last weekend I did some exploration on the script language execution performance. Specifically on the memory allocation side of things, and I would like to share my findings.

Script languages and memory usage

As you probably know script languages (most of them at least, like Python, Lua, etc) have the tendency to make a huge amount of small allocations on the heap. Almost everything is stored on the heap, and if you care for performance, you start to feel homesick about your beloved C stack! Anyway, nothing comes for free, and scripting languages have to take something from you in exchange for all the goods it gives you back. So the best you can do is make sure that you have the best memory allocator for the job.

Doing too many small allocations and releases on the heap can create memory fragmentation, along with all the evil that comes with this. The common approach is to create a specialized memory allocator that serves small and constant in size blocks of memory to the scripting language, taken from a bigger chunk of memory reserved from the system. This is a common in all “realtime” and intencive applications like games, and something I did many times to gain performance.

Can’t beat the standard malloc

What I discovered with my latest attempt was that it has gotten quite hard to beat the GNU implementation of malloc(). Something that used to be easy in the past when you focused on a specialized case (e.g. small blocks of memory). Not that you can’t do better if you try hard, but at this point the malloc() implementation is already super-fast for 99.9% of applications on the desktop. Rest asured that you will not be able to do much better. However that is not the case for embedded devices that don’t share the same virtual memory benefits as the desktop computers.

My hand tuned specialized memory allocator for small blocks of memory ( <= 256bytes ) was not able to be more that 1% faster that the native malloc() on the OS X 10.6. However on the iPhone the same allocator was twice as fast as the native malloc() ! Since the target was from the begining the iPhone that seemed like big win! However when I set up a small benchmark in the scripting environment that did some allocations of game engine objects and released then again in various patterns, the results were disappointing. The gain from using my specialized (and twice as fast) allocator resulted in improvement of about 5% in execution speed in a memory intensive benchmark. And at some tests even slower! That was odd and most of all not good!

Why I was failing

After some inspections and tests that made the case of me doing something really stupid less probable, I narrowed down the cause.

In most cases of using a scripting language you have some classes defined in C++ that you instatiate in the scripting language. Take for example a 3D vector class “CVector3” defined in C++. When you instatiate this in the script language you get two allocations. One in the scripting language that allocates the “proxy” object and one in the C++ environment. When giving a new allocator to the scripting language to do its allocations you only “optimize” the first allocation. The one in C++ still goes through the system default allocator.

And since you optimize half of the allocations you expect to have half the performance boost… well… wrong. It turns out that you can even be slower this way. The secret here is the CPU cache. By doing the above, you have two memory blocks that are usually accessed together, but are far apart in memory. This can really hurt performance badly on a device with slow memory like the iPhone.

The solution

The solution was of course to use the same allocator on the C++ side by overriding the “new” operator of the class. This made the blocks of memory allocated on the script side to be close to the block allocated on the C++ side. This way access to the object only involves accessing one part of the memory and giving nice cache hits. Performance up by 30%, which was nice and expected.

One other interesting thing that I found from this is that, on the iPhone, if I just override the “new” operator of a class and make it allocate the memory with plain malloc() and don’t use my allocator at all, the system is again faster!

This is probably from the fact that “new” does not go through plain malloc() (didn’t bother to check) as the scripting language environment does. So the allocated blocks end up in differect arenas at different parts of the memory, with the result of losing performance for the same reason as above!

So, keep your related allocations close together when crossing the language barrier!

  • Can u give more details about your small object allocator (anyway inspired by the allocator in modern c++ design) .What about the rest of the objects that do not relate with your script classes ,do you allocate with new or with some custom allocator principle ()linear allocators, DougLea etc) and what about destructors?The problem with iphone and generally small devices is that u cannot count on reference inc/dec cause u have no control over your memory.Generally the trend is to allocate a big chunk of memory and use it for everything and if u are tempted to use stl then u have to provide your own allocator ,I believe that since u know your assets u can estimate your memory budget from the beggining, the downside(?) is the u have to avoid objective c.Last why to use classes for anything ?I mean some times memory align pods are much more faster and cache friendly .Bottom line is that Iphone game programming is very close to console development ,no much memory no graphic processing power .We have to stop thinking that we program a pc , a mobile device is a different sport. Last try to search for data driven programming ull find some intersting stuff.

  • I basically do a stack of fixed allocators for incremental small blocks (4, 8, 12, etc) and allocate from that in O(1) time. For the release I do a “twist” on the pointer address to find the fixed allocator in use, again in O(1) time (the allocator’s chunk for 4,8,12 etc follow each other in memory for that purpose). This wastes some memory but I can tune how much I want to spend.

    All small game object allocations go through this allocator now. The rest is feed to the system allocator, but that is mostly done at the loading phase. All assets needed are preloaded at the begining of a “scene” and never touched again until the scene unloads. If that turns out to create fragmentation when many scene loads/unloads happen after long play, I will deal with it then.

    I really enjoy development on iPhone lately… it makes everything interesting again! 🙂

  • I saw a link to this article on Google and really enjoyed it. Thank you!

  • What’s up friends, how is the whole thing, and what you want to say about this piece of writing, in my view its truly awesome in support of me.