OGRE 14.3
Object-Oriented Graphics Rendering Engine
|
A common question is why should I use instancing. The big reason is performance. There can be 10x improvements or more when used correctly. Here's a guide on when you should use instancing:
If these three requirements are all met in your game, chances are instancing is for you. There will be minimal gains when using instancing on an Entity that repeats very little, or if each instance actually has a different material, or it could run even slower if the Entity never repeats.
If the bottleneck in your game is not CPU (i.e. it's in the GPU) then instancing won't make a noticeable difference.
As explained in the previous section, instancing groups all instances into one draw call. However this is half the truth. Instancing actually groups a certain number of instances into a batch. One batch = One draw call.
If the technique is using 80 instances per batch; then rendering 160 instances is going to need 2 draw calls (two batches); if there are 180 instances, 3 draw calls will be needed (3 batches).
What is a good value for instances-per-batch setting? That depends on a lot of factors, which you will have to profile. Normally, increasing the number should improve performance because the system is most likely CPU bottleneck. However, past certain number, certain trade offs begin to show up:
The actual value will depend a lot on the application and whether all instances are often on screen or frustum culled and whether the total number of instances can be known at production time (i.e. environment props). Normally numbers between 80 and 500 work best, but there have been cases where big values like 5.000 actually improved performance.
Ogre supports 4 different instancing techniques. Unfortunately, each of them requires a different vertex shader, since their approaches are different. Also their compatibility and performance varies.
This is the same technique the old InstancedGeometry implementation used (with improvements).
Basically it creates a large vertex buffer with many repeating entities, and sends per instance data through shader constants. Because SM 2.0 & 3.0 have up to 256 shader constant registers, this means there can be approx up to 84 instances per batch, assuming they're not skinned But using shader constants for other stuff (i.e. lighting) also affects negatively this number A mesh with skeletally animated 2 bones reduces the number 84 to 42 instances per batch.
The main advantage of this technique is that it's supported on a high variety of hardware (SM 2.0 cards are required) and the same shader can be used for both skeletally animated normal entities and instanced entities without a single change required.
Unlike the old InstancedGeometry
implementation, the developer doesn't need to worry about reaching the 84 instances limit, the InstanceManager automatically takes care of splitting and creating new batches. But beware internally, this means less performance improvement. Another improvement is that vertex buffers are shared between batches, which significantly reduces GPU VRAM usage.
Vertex Shader input example:
Vertex position calculation example:
See Examples/Instancing/RTSS/Robot
for an example on how to use instancing with the RTSS: Run Time Shader System
See material Examples/Instancing/ShaderBased
for an example on how to write the vertex shader. (You can find these files in OGRE sources from before v1.12) Files:
Instancing implementation using vertex texture through Vertex Texture Fetch (VTF) This implementation has the following advantages:
- Supports huge amount of instances per batch
But beware the disadvantages:
Whether this performs great or bad depends on the hardware. It improved up to 4x performance on a Intel Core 2 Quad Core X9650 GeForce 8600 GTS, and in an Intel Core 2 Duo P7350 ATI Mobility Radeon HD 4650, but went 0.75x slower on an AthlonX2 5000+ integrated nForce 6150 SE Each BaseInstanceBatchVTF has it's own texture, which occupies memory in VRAM. Approx VRAM usage can be computed by doing 12 bytes * 3 * numInstances * numBones Use flag IM_VTFBESTFIT to avoid wasting VRAM (but may reduce amount of instances per batch).
The material requires at least a texture unit stage named InstancingVTF
Texture unit example:
Vertex Shader input example:
Vertex position calculation example:
See material Examples/Instancing/VTF
for an example on how to write the vertex shader and setup the material. Files:
This is the same technique as VTF; but implemented through hardware instancing. It is probably one of the best and most flexible techniques.
The vertex shader has to be slightly different from SW VTF version.
Texture unit example:
Vertex Shader input example:
Vertex position calculation example:
See material Examples/Instancing/HW_VTF
for an example on how to write the vertex shader and setup the material. Files:
LUT is a special feature of HW VTF; which stands for Look Up Table. It has been particularly designed for drawing large animated crowds.
The technique is a trick that works by animating a limited number of instances (i.e. 16 animations) storing them in a look up table in the VTF, and then repeating these animations to all instances uniformly, giving the appearance that all instances are independently animated when seen in large crowds.
To enable the use of LUT, SceneManager::createInstanceManager
's flags must include the flag IM_VTFBONEMATRIXLOOKUP
and specify HW VTF as technique.
See material Examples/Instancing/HW_VTF_LUT
. Files:
This is technique requires true instancing hardware support.
Basically it creates a cloned vertex buffer from the original, with an extra buffer containing 3 additional TEXCOORDS
(12 bytes) repeated as much as the instance count. That will be used for each instance data.
The main advantage of this technique is that it's VERY fast; but it doesn't support skeletal animation at all. Very reduced memory consumption and bandwidth. Great for particles, debris, bricks, trees, sprites. This batch is one of the few (if not the only) techniques that allows culling on an individual basis. This means we can save vertex shader performance for instances that aren't in scene or just not focused by the camera.
Vertex Shader input example:
Vertex position calculation example:
See Examples/Instancing/RTSS/Robot
for an example on how to use instancing with the RTSS: Run Time Shader System
See material Examples/Instancing/HWBasic
for an example. (You can find these files in OGRE sources from before v1.12) Files:
Some instancing techniques allow passing custom parameters to vertex shaders. For example a custom colour in an RTS game to identify player units; a single value for randomly colouring vegetation, light parameters for rendering deferred shading's light volumes (diffuse colour, specular colour, etc)
At the time of writing only HW Basic supports passing the custom parameters. All other techniques will ignore it.[^8]
To use custom parameters, call InstanceManager::setNumCustomParams
to tell the number of custom parameters the user will need. This number cannot be changed after creating the first batch (call createInstancedEntity)
Afterwards, it's just a matter of calling InstancedEntity::setCustomParam
with the param you wish to send.
For HW Basic techniques, the vertex shader will receive the custom param in an extra TEXCOORD.
Multiple submeshes means different instance managers, because instancing can only be applied to the same submesh.
Nevertheless, it is actually quite easy to support multiple submeshes. The first step is to create the InstanceManager setting the subMeshIdx
parameter to the number of submesh you want to use:
The second step lies in sharing the transform with one of the submeshes (which will be named 'master'; i.e. the first submesh) to improve performance and reduce RAM consumption when creating the Instanced Entities:
Note that it is perfectly possible that each InstancedEntity
based on a different "submesh" uses a different material. Selecting the same material won't cause the InstanceManagers to get batched together (though the RenderQueue will try to reduce state change reduction, like with any normal Entity).
Because the transform is shared, animating the master InstancedEntity (in this example, instancedEntity[0]
) will cause all other slave instances to follow the same animation.
To destroy the instanced entities, use the normal procedure:
There are two kinds of fragmentation:
Defragmented batches can dramatically improve performance:
Suppose there 50 instances per batch, and 100 batches total (which means 5000 instanced entities of the same mesh with same material), and they're all moving all the time.
Normally, Ogre first updates all instances' position, then their AABBs; and while at it, computes the AABB for each batch that encloses all of its instances.
When frustum culling, we first cull the batches, then we cull their instances[^9] (that are inside those culled batches). This is the typical hierarchical culling optimization. We then upload the instances transforms to the GPU.
After moving many instances around the whole world, they will make the batch' enclosing AABB bigger and bigger. Eventually, every batch' AABB will be so large, that wherever the camera looks, all 100 batches will end up passing the frustum culling test; thus having to resort to cull all 5000 instances individually.
If you're creating static objects that won't move (i.e. trees), create them sorted by proximity. This helps both types of fragmentation:
There are cases where preventing fragmentation, for example units in an RTS game. By design, all units may end up scattering and moving from one extreme of the scene to the other after hours of gameplay; additionally, lots of units may be in an endless loop of creation and destroying, but if the loop for a certain type of unit is broken; it is possible to end up with the kind of "Deletion" Fragmentation too.
For this reason, the function InstanceManager::defragmentBatches( bool optimizeCulling )
exists.
Using it as simple as calling the function. The sample NewInstancing shows how to do this interactively. When optimizeCulling
is true, both types of fragmentation will be attempted to be fixed. When false, only the "deletion" kind of fragmentation will be fixed.
Take in mind that when optimizeCulling = true
it takes significantly more time depending on the level of fragmentation and could cause framerate spikes, even stalls. Do it sparingly and profile the optimal frequency of calling.
For example, to modify the HW VTF vertex shader, you need to sample the additional matrices from the VTF:
As you can witness, a HW VTF vertex shader with 4 weights per vertex needs a lot of texture fetches. Fortunately they fit the texture cache very well; nonetheless it's something to keep watching out.
Instancing is meant for rendering large number of objects in a scene. If you plan on rendering thousands or tens of thousands of animated objects with 4 weights per vertex, don't expect it to be fast; no matter what technique you use to draw them.
Try convincing the art department to lower the animation quality or just use Ogre::IM_FORCEONEWEIGHT for Ogre to do the downgrade for you. There are many plugins for popular modeling packages (3DS Max, Maya, Blender) out there that help automatizing this task.
A: Your rig uses more than one weight per vertex. Either create the instance manager with the flag Ogre::IM_FORCEONEWEIGHT, or modify the vertex shader to support the exact amount of weights per vertex needed (see previous questions).
A: The quickest way is by looking at the type of Ogre::VES_BLEND_WEIGHTS, where VET_FLOAT<N>
means N weights.
[^8]: In theory all other techniques could implement custom parameters but for performance reasons only HW VTF is well suited to implement it. Though it yet remains to be seen whether it should be passed to the shader through the VTF, or through additional TEXCOORDs.
[^9]: Only HW instancing techniques cull per instance. SW instancing techniques send all of their instances, zeroing matrices of those instances that are not in the scene.