OGRE-Next
3.0.0
Object-Oriented Graphics Rendering Engine
|
The main change has been a full refactor of textures.
The main reason of 2.2 is a complete overhaul of the Texture system, with a focus on lowering GPU RAM consumption, background texture streaming, asynchronic GPU to CPU texture transfers, and reducing out of GPU memory scenarios (which are relatively easy to run into if your project is medium-to-large sized).
Ogre 2.2 is much more mobile friendly. Metal introduced the concepts of “load” and “store” actions, and we follow that paradigm because it’s easy to implement and understand.
First we need to explain how mobile TBDR GPUs work (**T**iled **B**ased **D**eferred **R**enderers). In a regular immediate GPU (any Desktop GPU), the GPU just processes and draws triangles to the screen in the order they’re submitted, writing results to RAM, and reading from RAM if need to. Run the vertex shader, then the pixel shader, go on to the next triangle. The process is slightly more complex because there’s a lot of parallelization going on (i.e. multiple triangles worked on a the same time), but the overall scheme of things is that desktop GPUs process things in order.
TBDRs work differently: They process all the vertices first (i.e. run all of the vertex shaders), and then later it goes through each tile (usually of 8×8 pixels), finds which triangles touch that tile; sorts them front to back (unless alpha testing or alpha blending is used). and then runs the pixel shaders on all the triangles and pixels. Then proceeds to the next tile. This has the following advantages:
The main disadvantage is that it does not scale well to a large number of vertices (since the GPU must store all of the processed vertices somewhere). There’s a performance cliff past certain vertex count: Exceed certain threshold and you’ll see your framerate decrease very fast the more vertices you add once you’re beyond that limit.
This is not usually a problem in mobile because well… nobody expects a phone to process 4 million vertices or more per frame. Also you can make it up by using improved pixel shaders (because of the Early Z you get for free).
In TBDRs, each tile is a self-contained unit of work that never flushes the cache from start to end until all the works has been done (unless the GPU has ran out of space because we’ve submitted too much work, but let’s not dwell into that).
If you want a more in-depth explanation, read A look at the PowerVR graphics architecture: Tile-based rendering and Understanding PowerVR Series5XT: PowerVR, TBDR and architecture efficiency.
In an immediate renderer, clearing the colour and depth buffers means we instruct the GPU to basically memset the whole texture to a specific value. And then we render to it.
In TBDRs, this is inefficient; as the memset will store a value to RAM, that later needs to be read from RAM. TBDRs can:
The Metal RenderSystem in Ogre 2.1 tried to merge clears alongside draws as much as possible, but it didn’t always get it right; and it glitched when doing complex MSAA rendering.
Now in Ogre 2.2 you can explicitly specify what you want to do. For load actions you can do:
For store actions you also get:
This gives you a lot of power and control over how mobile GPUs control their caches in order to achieve maximum performance. But be careful: If you set a load or store action to “DontCare” and then you do need the values, then you’ll end up reading garbage every frame (uninitialized values), which can result in glitches.
These semantics can also help on Desktop. Whenever we can, we emulate this behavior (to make porting to mobile easier), but we also tell the API about this information whenever the DX11 and GL APIs allow us. This can mostly help with SLI and Crossfire.
Explicit MSAA finally arrived in Ogre 2.2; and thanks to load and store actions; you have a lot of control over what happens with MSAA and when; which can result in high quality rendering by making MSAA a first class citizen.
There have been other numerous MSAA changes. In Ogre 2.1 MRT + MSAA did not work correctly except for D3D11 (which makes SSR sample to only work with MSAA in D3D11). Now it works everywhere.
Another change for example that in Ogre 2.1 all render textures in the compositor default to using MSAA if the main target is using MSAA. In Ogre 2.2; we default to not using MSAA unless told otherwise. We found out most textures do not need MSAA because they’re postprocessing FXs or similar; thus the MSAA is only a waste of memory and performance by doing useless resolves. Therefore only a few render textures actually need MSAA. This saves a lot of GPU RAM and some performance.
The impact from this change can vary a lot based on how you were using Ogre:
The following classes belonged to the "old" texturing system were removed or are not functional and scheduled for removal:
They were instead replaced by the following classes:
There are a few things that need to be clarified first hand:
The following table summarizes old and new classes:
Task | 2.1 | 2.2 |
---|---|---|
Use a texture | Texture | TextureGpu |
Render to texture | RenderTarget / RenderTexture | TextureGpu |
Access a cubemap face or mipmap | HardwarePixelBuffer | TextureGpu |
Upload data (CPU → GPU) | Map HardwarePixelBuffer | StagingTexture |
Download data (GPU → CPU) | Map HardwarePixelBuffer | AsyncTextureTicket |
Setup MRT (Multiple Render Target) | RenderTexture + MultiRenderTarget | RenderPassDescriptor |
Creating / destroying textures. Loading from file | TextureManager | TextureGpuManager |
Dealing with Hlms texture arrays | HlmsTextureManager | TextureGpuManager |
Managing a window (events, resizing, etc) | RenderWindow | Window |
Rendering to a window | RenderWindow | TextureGpu |
Dealing with depth buffers | DepthBuffer | TextureGpu |
You may have noticed that 'TextureGpu' is now repeated a lot. That is because in 2.1 the functionality was mainly fragmented between 3 classes:
This was madness because these distinctions were applied inconsistently and often made no sense. e.g. a RenderTexture is often drawn to, and then used as a texture source. Or we want to render to several mips (which are represented as separate RenderTargets) This means we had to constantly walk upwards and downwards the hierarchy to get the information we needed.
Now in Ogre 2.2, all of that is condensed into one class:
This doesn't mean that TextureGpu is an overgrown God Class. The fragmentation into 3 was a bad idea to begin with.
PixelFormat is deprecated and will be removed. You should use PixelFormatGpu instead. The way to read the format is the following:
The boolean "hwGamma" that usually came alongside the pixel format is gone. Now the gamma version is part of the format. For example PFG_RGBA8_UNORM and PFG_RGBA8_UNORM_SRGB are both the same format except one is.
Here's a table with common translations:
Old | New |
---|---|
PF_L8 | PFG_R8_UNORM |
PF_L16 | PFG_R16_UNORM |
PF_A8B8G8R8 PF_BYTE_BGR PF_BYTE_BGRA | PFG_RGBA8_UNORM PFG_RGBA8_UNORM_SRGB |
PF_A8R8G8B8 PF_BYTE_RGB PF_BYTE_RGBA | PFG_RGBA8_UNORM (*) PFG_RGBA8_UNORM_SRGB(*) |
PF_DXT1 PF_DXT2 | PFG_BC1_UNORM PFG_BC1_UNORM_SRGB |
PF_DXT3 PF_DXT4 | PFG_BC2_UNORM PFG_BC2_UNORM_SRGB |
PF_DXT5 | PFG_BC3_UNORM PFG_BC3_UNORM_SRGB |
(*)The correct format would be PFG_BGRA8_UNORM, PFG_BGRX8_UNORM, PFG_BGRA8_UNORM_SRGB and PFG_BGRX8_UNORM_SRGB. However avoid these, because:
How do I change toggle gamma correction on and off dynamically?
Create the texture with the flag TextureFlags::Reinterpretable and then use DescriptorSetTexture2 to interpret it with a different format (advanced users).
Compute job passes use DescriptorSetTexture2 internally by default and support feature format reinterpretation out of the box.
Where are PF_L8 and PF_L16?
These no longer exist as they did not exist in D3D11 & Metal and were being emulated in GL. Use PFG_R8_UNORM & PFG_R16_UNORM instead, which return the data in the red channel, as opposed to L8 which in GL returned data in all 3 RGB channels (but in D3D11 & Metal, only returned data in the red channel...). Use HlmsUnlitDatablock::setTextureSwizzle to control how the channels are routed to your liking.
The following snippet loads a texture from file:
Note the texture is not loaded yet, so it consumes very little RAM, and functions like getWidth() and getHeight() won't return valid values. To load it, call:
This will begin loading in the background, in a secondary thread managed by TextureGpuManager.
For example, Trying to call texture->getWidth() immediately after scheduleTransitionTo will incur in a race condition: if the secondary thread didn't load the texture yet, you will get garbage values (probably 0 if the texture is new), or you'll get the real value if the thread managed to load the texture in that time.
To safely call texture->getWidth() you can call:
This will block the calling thread until the metadata has been loaded. Metadata means:
But the actual texture contents aren't done loaded yet. For that we have:
This will block the calling thread until the texture is fully loaded.
Last, but not least, you can also call:
This will block the calling thread until all pending textures are loaded.
The following snippet creates a texture that will be filled by hand (e.g. manually loading from file yourself, procedural generation, etc):
You don't necessarily need to have one StagingTexture per TextureGpu for uploads. For example you could have four 1024x1024 TextureGpus and request one StagingTexture of 2048x2048 or one of 1024x1024x4; map it four times and perform four uploads. In pseudo code:
Please watch out for three things:
TextureBox::data
on failure. This space gets restored once stopMapRegion is called.StagingTexture::supportsFormat
to check if the parameters are compatible with the upload you're trying to do. However, mapRegion may still fail if it has run out of space. If supportsFormat returns false, it means mapRegion will always fail. If supportsFormat returns true, it means mapRegion may or may not succeed.Once you've called StagingTexture::upload
; calling StagingTexture::startMapRegion
again will stall until the copy is done. You can call StagingTexture::uploadWillStall
to know if the StagingTexture is ready or not.
For streaming every frame (e.g. video playback, procedural animation from CPU), you should use two or three StagingTextures (double vs triple buffer), and use one each frame in cycle (do not release these StagingTextures every frame, hold on to them instead).
For that we'll use AsyncTextureTickets. They're like StagingTextures, but in the opposite direction.
Downloading a single mip for a 2D texture from GPU is straightforward and this code should be enough. But a generic version that works on all types of textures has many small details e.g. what if the TextureGpu uses msaa? It needs to be resolved first.
It is for that reason that we advise to use Image2::convertFromTexture
, which handles all the small details.
Download streaming is very common in the case of video recording and web streaming.
After calling AsyncTextureTicket::download
, calling AsyncTextureTicket::map
to access the contents can stall until the GPU is done with the transfer. You can check if we're done by calling AsyncTextureTicket::queryIsTransferDone
.
You should have two or three AsyncTextureTickets (double vs triple buffer) call download() at frame N, and call map() at frame N+3 and recycle them (rather than destroying them).
Additionally, since you'll be mapping 3 frames afterwards, you should call download( texture, mip, accurateTracking=false );
Using accurateTracking = false reduces tracking overhead, at the expense of more innacurate queryIsTransferDone calls (which are unnecessary if you are waiting for 3 frames to download).
Warning: calling queryIsTransferDone too often when accurateTracking = false; will cause Ogre to switch to accurateTracking, and cause unnecessary overhead and can hurt your performance a lot. This is because Ogre will assume your code could be stuck in an infinite loop, e.g.
and this loop would never exit unless Ogre switches that ticket to accurate tracking. An warning in the Ogre.log will be logged if this happens.
Accurate tracking isn't slow, but switching from innaccurate to accurate is potentially very slow, depending on many circumstances; as worst case scenario it can produce a full stall (CPU waits until GPU has fully finished doing all pending jobs).
You may find that throughout Ogre we refer to depth
, numSlices
, and depthOrSlices
. They're similar, but conceptually different.
Depth is associated with 3D volume textures. Depth is affected by mipmaps. Meaning that a 512x512x256 texture at mip 0 becomes 256x256x128 at mip 1. Current GPUs impose a maximum resolution of 2048x2048x2048 for 3D volume textures. Note however, depending on format, you may run out of GPU memory before reaching such high resolution.
NumSlices is associated with everything else: 1D Array & 2D Array textures, cubemap & cubemap array textures. A cubemap has a hardcoded numSlices
of 6. A cubemap array must be multiple of 6. Slices are not affected by mipmaps. Meaning that a 512x512x256 texture at mip 0 becomes 256x256x256 at mip 1.
depthOrSlices means that we're storing both depth and numSlices in the same variable. When this happens, usually extra information is needed (in other words, is this a 3D texture or not?). For example TextureGpu declares uint32 mDepthOrSlices
because it also declares & knows the texture type TextureTypes::TextureTypes mTextureType
If mipmapping isn't involved, then the texture type is not required, thus for the maths needed, depthOrSlices is the same as depth or numSlices.
AMD GCN Note: At the time of writing (2018/02/01), all known GCN cards add padding to round texture resolutions to the next power of 2 if they have mipmaps. This means a cubemap of 1024x1024x6 actually takes as much as 1024x1024x8. This is very particularly important for 2D Array textures. Always try to use powers of 2 for your slices, otherwise you'll be wasting memory.
All data manipulated by Ogre (e.g. TextureBox, Image2, StagingTextures and AsyncTextureTickets) have the following memory layouts and rules:
Internal layout of data in memory:
The layout for 3D volume and array textures is the same. The only thing that changes is that for 3D volumes, the depth also decreases with each mip, while for array textures it is kept constant.
For 1D array textures, the number of slices is stored in mDepthOrSlices, not in Height.
For code reference, look at _getSysRamCopyAsBox implementation, and TextureBox::at.
Each row of pixels is aligned to 4 bytes (except for compressed formats that require more strict alignments, such as alignment to the block).
You can calculate bytesPerRow by doing:
Ogre 2.2 code is new. While a lot of bugs have been ironed out already, the streaming code may still contain a few hidden bugs.
Particularly, most of these bugs only surface if textures are loaded in a particular order (because their resolution or pixel formats affect our algorithms in a way they misbehave).
This is problematic with threading because results become non-deterministic: the first run texture A was uploaded then B was uploaded, but on the second run texture A and B were both uploaded as part of the same batch. Likewise, different computers take different time to upload (because their CPUs and drives are slower/faster) unearthing bugs that didn't reproduce in your machine.
To troubleshoot these annoying bugs, you can go to OgreMain/src/OgreTextureGpuManager.cpp and uncomment the following macro:
This will force Ogre to stream using the main thread, thus behaving deterministically in all machines.
Uncommenting the following macro may help finding out what's going on:
If the problem does not reproduce at all when OGRE_FORCE_TEXTURE_STREAMING_ON_MAIN_THREAD is defined, then play with the following parameters until you nail the problem:
And of course, report your problem and findings in https://forums.ogre3d.org
So far we've covered how to use regular textures. But we left out another big one: Render Textures. Well, that's easy: just create a texture with TextureFlags::RenderTexture flag:
But... how do we render to it?
Compositors hide most of the complexity from you, and you will likely not be dealing with RenderPassDescriptors directly.
However if you're still reading here, chances are you're developing something advanced that requires low level rendering, like developing your own Hlms or rendering a third party GUI library.
RenderPassDescriptors describe self contained units of work to submit to GPUs, basically to keep mobile TBDRs happy as described in a previous section.
Setting up a basic RenderPassDescriptors is straightforward:
A few notes:
Note RenderPassDescriptors have more parameters. For example, you can set which mipmap you want to render to, or which slice in an array. You can also render to a 1024x1024 MSAA texture and resolve the result into the mip 1 of a 2048x2048 texture (mip 1 is 1024x1024).
For further code reference on setting up RenderPassDescriptors you should look into CompositorPass::setupRenderPassDesc
.
Note: This section is only relevant to those writing their own Hlms implementations.
Ogre 2.2 uses a different binding model to make compatibility in the future with Vulkan and D3D12 easier.
Rather than binding one texture at a time into a huge table, these APIs work with the concept of "descriptor sets". We could say in very layman's terms, that these are just an array of textures, and every frame we bind the list instead.
Descriptor sets are managed in a similar way to how HlmsMacroblocks and HlmsBlendblocks are created and destroyed:
hazardousTexIdx is a special index for textures that are potential hazards, such as when a texture in particular in the descriptor set could be currently be also bound as RenderTarget (which is illegal / undefined behavior). hazardousTexIdx is in range [0; mTextureDescSet.mTextures.size()). If the value is outside that range, we assume there are no potentially hazardous texture inside the descriptor set. This value is only used by D3D11 & Metal.
Creating a DescriptorSetTexture is this easy, and the same applies for DescriptorSetSampler and DescriptorSetUav. The difference between DescriptorSetTexture and DescriptorSetTexture2 is that the latter allows doing more advanced stuff (such as reinterpretting the texture using a different pixel format)
But if you take a look at OGRE_HLMS_TEXTURE_BASE_CLASS::bakeTextures
in Components/Hlms/Common/include/OgreHlmsTextureBaseClass.inl you'll notice this routine that generates the descriptor texture & sampler sets is overly complex. Why is that?
The complexity of bakeTextures comes from two parts:
First, we try to take advantage of D3D11 and Metal. In OpenGL, texture and samplers are tied together. This means that texture at slot 5 must use the sampler at slot 5. The main drawback from this approach is that OpenGL is often limited to 16 textures per shading stage (around 30 textures per shader stage for more modern cards and drivers). While D3D11 and Metal split these two, meaning that texture at slot 5 can use sampler at slot 1. This allows D3D11 and Metal to bind up to 128 textures and 16 samplers at the same time. That's a lot more than what OpenGL can do.
Because supporting more than 16 textures has been a popular complaint about Ogre 2.1, and dropping OpenGL is not an option, bakeTextures handles both paths. The code would be simplified if we just assumed one of these paths.
The second reason is descriptor reuse: Material A may use Texture X, Y & Z. While Material B may use Texture Z, Y & X (same textures, different order). This different order would cause two sets to be generated. However we sort and ensure textures are deterministically ordered; therefore the texture sets can be shared between both materials as only one will be generated.
Additionally, please note that descriptor sets need to be invalidated when a texture changes residency, which is why we listen for such changes via notifyTextureChanged.
Yes. Ogre 2.2 got rid of anything that used the "old" Textures. That includes the HlmsTextureManager
.
The new TextureGpuManager
, which replaces the old TextureManager
, also replaces HlmsTextureManager
.
The functionality that was provided by HlmsTextureManager
(pretend "a texture" was just one texture when behind the scenes it's actually a slice from a Texture2D Array) became first class citizen in 2.2:
When TextureFlags::AutomaticBatching
is present, the TextureGpu will assumes this is a TextureTypes::Type2D
texture that behinds the scenes is actually a slices to shared TextureTypes::Texture2DArray
texture.
The following routines are relevant when dealing with AutomaticBatching
textures:
Most functions completely hide the fact that you're dealing with an array for you.
For example getTextureType()
will return Type2D
(which is actually a lie) and TextureGpu::copyTo
will fail if dstBox parameter contains z
or sliceStart
> 0, because Ogre will internally add the internal slice start offset to whatever you ask.
Same will happen with StagingTextures and AsyncTextureTickets.
If you want to get the real thing, you need to grab TexturePool::masterTexture
(which is a TextureGpu
) and TextureGpu::getInternalSliceStart
to find the slice.
A lot of functionality from TextureGpuManager
will result familiar as they came from HlmsTextureManager
. For example:
Just like with the old HlmsTextureManager
, you can specify the resource filename where the texture should be loaded from (via name
and resourceGroup
) while also specifying an "alias" to reference it with another name.
This allows you for example to load MyNormalMap.png two times, one as a RGBA8_UNORM texture to display its contents, and as a RG8_SNORM texture for use as a normal map, as long as you assign them two different alias names.
At the time of writing a few restrictions from the past still remain though: Unlit will accept both any type of TextureGpu
, while PBS will only accept TextureGpu
that are Texture2DArray
or TextureGpu
with AutomaticBatching
set (except for the reflections which must be TypeCube
).
Since most TextureGpu textures are loaded as AutomaticBatching
by default, this limitation on PBS should be less of an issue than it was on 2.1.
If you've done your own Hlms implementation (i.e. you're an advanced user), then there are a few changes you need to be aware:
A lot of texture shared functionality has been moved out of HlmsUnlitDatablock and HlmsPbsDatablock into Components/Hlms/Common/include/OgreHlmsTextureBaseClass.h
This header uses macros (not ideal, I know) to alter hardcoded maximum numbers of textures supported.
For example HlmsPbsDatablock derives from it by including the header, but previously defining a few macros:
What OgreHlmsTextureBaseClass does is to keep track of which textures have been assigned to the material at each slot; and register listeners to these textures whenever the textures finish loading (or are unloaded) in order to alter the DescriptorSetTextures and DescriptorSetSamplers.
Furthermore it is in charge of making sure when using GL3+ that DescriptorSetTextures and DescriptorSetSamplers both match 1:1 (see hasSeparateSamplers), while when using D3D11 & Metal they're separated in order to improve performance and significantly raise the number of textures that can be bound per shader.
OgreHlmsTextureBaseClass also sorts DescriptorSetTextures given a specific criteria to allow sharing of descriptors between different materials (and better draw call sorting) For example if Material X uses texture A then B and Material Y uses texture B then A, OgreHlmsTextureBaseClass sorts the descriptor so that A always comes before B, and thus allow reuse.
Because textures and samplers have been separated, diffuse_map0_idx indicates the index into the texture array, and the new property "diffuse_map0_sampler" indicates the index of the sampler to use.
Mipmaps
In 2.1 getNumMipmaps() = 0 means having just 1 mip (the mip 0). Only the extra mips are counted.
In 2.2 getNumMipmaps() cannot return 0, as the 1 mip is counted.
This can cause off by 1 errors. For example old code:
Must now become:
TexturePtr is default initialized to 0. TextureGpu is not.
This causes common problems with arrays of textures in C++ i.e.TexturePtr myTextures[5];
If you do not initialize your TextureGpu variable(s), they will contain uninitialized values, whereas your old code would be initialized to 0.
Compositor textures are non-msaa by default. Use msaa <number of samples>
to explicitly set MSAA on a compositor texture or msaa_auto
to use the same MSAA setting as the final output of the workspace. Previous behavior was as if msaa_auto
was present in all textures unless it was explicitly turned off by you.
RTV (Render Target Views) in Compositors are required They're not implicit. In Ogre 2.1 this was enough to render to a texture when setting up the compositor from C++:
The equivalent code for 2.2 would be the following:
However this is not enough. We do not render to tmpCubemap. We render to an RTV, which references tmpCubemap. We'll now create this RTV and we will use the convenience function RenderTargetViewDef::setForTextureDefinition
which does all the job for us:
Please note a few things things:
TextureDefinition
contains a few variables such as depthBufferId
, depthBufferFormat
and preferDepthTexture
which are then repeated in RenderTargetViewDef
. The settings that matters are the ones in RenderTargetViewDef
and are often (but not necessarily) just a carbon copy. So why TextureDefinition
has a duplicate? It's because when textures are used as output and input for inter-connecting nodes, the RTV settings are lost and the Compositor needs to evaluate at connection time what depth buffer settings the texture prefers. For more information see CompositorPass::setupRenderPassDesc
snippet that starts with if( rtv->isRuntimeAnalyzed() )
Also see Ogre::RenderTargetViewDef::setRuntimeAnalyzed documentation.D3D11's specific:
You could use getCustomAttribute
to retrieve several D3D11 internal pointers. These have changed: