I really don't want to discourage you guys from analyzing client performance and suggesting changes to improve it, but I really cannot help but think that y'all are looking in the wrong places. For instance:
romovs wrote:Yes too many unnecessary object instantinations in critical routines.
Object creation in Hotspot is generally extremely cheap. I have at times even tested and benchmarked converting some very common object creations into pooled or mutable objects instead, and it has made no difference whatever. Which is as expected, really, as Hotspot implements object allocation as little more than the increment of a register. Young-gen GCs (or GCs overall, for that matter) also represent almost no part of the total CPU time expended by the client.
aghmed wrote:coord.add(c2).mul(c3).mul(c4)
Even with object allocation being cheap, I'm
pretty sure the JIT will inline those calls, convert the non-escaping instances into stack allocations, and in the end even move those into register-only values.
aghmed wrote:Every single openGL request is converted to a wannabe delegate "Command" that is added to a pool of commands and computed.
For those who don't know, the
BGL and its
Commands are used to off-load the actual calling of OpenGL driver routines to a second thread, so it is a part of parallellization. However, when I implemented it, and was at the stage where I had the
BGL but not the secondary thread to do the dispatch, I ran it all in one thread, and even then it was still faster (by about 10%) than doing the OpenGL calls directly in place instead of allocating
Command instances. My hypothesis on why this is so, is that bunching the preparation code and the OpenGL calls together, each for themselves, simply improves cache locality.
It is also useful to observe that the thread that actually does the OpenGL command dispatch is only very rarely the bottleneck anyway. At least 9 times out of 10, the bottleneck is rather the UI thread, which generates the commands. Therefore, even
if the
BGL dispatch loop may be said to be slow in some way, optimizing it would make no difference -- it would just increase the amount of time that thread is waiting for the next frame to dispatch.
When I tried this benchmark, I found not only that it was mostly GC-limited, but also that it doesn't follow reasonable practice for JIT warmup. Changing it slightly to only use 1,000,000 commands, and instead repeating the test 1,000 times showed that the average time taken to allocate and then dispatch a single command was only on the order of 25 ns.
To offer my theory on why the client performs as it does and how to improve it, I'm pretty sure it is simply because the client treats all objects (graphical, in-game objects, that is) as being completely dynamic and treating them all with equal dignity. I'm fairly sure the main reason why most modern games are so much faster is because their engines know what objects are static and can fast-path them appropriately, whereas the Haven client assumes that all objects may change at any time and does complete setup of them from scratch for every cycle. I'm fairly sure the reasonable way to fix this would be to give the client a notion of static objects, and allow it to cache rendering information about them from frame to frame. Optimally, it could perhaps even save a stand-alone
BGL list for each group of such objects and just resubmit that for dispatch every frame.