Animated Crowd Rendering

About

Motivation

I wanted to learn techniques to lighten the load on the CPU and utilize the GPU more, and had recently been working a lot with animations. So when I came across animated crowd rendering, I felt like that this was a good project for me that I would be able to complete. I integrated the implementation in our schools rendering framework "TGE".

I got inspired by a GDC talk about Ghost of Tsushima and how they used cheap GPU based animations to fill up scenes and create living environments. I like the relative easy way you can populate scenes with insects, animals and bids, with simple behaviours that gives a huge impact on how alive a world feels.

Inspiration:

https://youtu.be/d61_o4CGQd8?t=509

Resources:

https://youtu.be/EUTE1SoOGrk?t=1139
https://developer.nvidia.com/gpugems/gpugems3/part-i-geometry/chapter-2-animated-crowd-rendering
Assets are from The Game Assembly's Spite resource bank

Implementation

Animation Texture

When the scene starts, the concerned animations gets converted into a 2D texture, on per model. The first row contains a texture header and extra headers for each animation, containing start row, number of frames, duration and padding (for optimisation). Then the rest rows are the frames for the animations tightly packed, and each column are the 3x4 matrices for each bone.
Then send the texture to the GPU and store the shader resource together with the other related components, such as shaders and instance buffer etc.

Instancing

All the models world space 3x4 matrices and uint4 animation datas are stored in an instance buffer, after a per instance culling pass. The animation data contains what animation to play, animation speed, padding and start time. If the start time is negative, it means that the animation is looping.

Vertex shader

Then the headers gets loaded in the shader, together the animation data the current frame gets calculated and we can create our skinned matrix. The rest of the shader is the sames as for normal animation.

Optimizations

Frame interpolation

R32B32G32A32 Texture

In my first iteration I used a R32 texture, then I changed it to a R32B32G32A32 texture and there was a huge performance increase. Because GPUs often are optimized to fetch 128 bits of data. I tried using a structured buffer and fetch 3x4 floats of data at a time, but there weren't any noticeable performance gain and my texture implementation were already done so I scraped that, to save time.

Scalable FPS

The animations hade worse quality compared to the CPU based animations, because the are no interpolations between frames. I wanted to keep them light weight because my intentions with this implementation was to populate scenes in a cheap way. Then it hit me that I could just interpolate the frames when the textures are created, without any performance loss but memory. I also multiplied the bone matrices with the bind pose inverse when creating the textures instead of in the shader, because the vertices are stored in the bind pose.

Results and Improvements

FPS Comparisons

In a scene with 768 animations, playing with 6 different models, some pretty complex, and a simple behaviour. I get about 160 fps using instanced crowd animation (GPU based), and about 80 fps doing it normally on the CPU (CPU based). So it's about an 100% performance increase.
The biggest gains are when run in debug. Then it's 45 fps (GPU based) vs 6 fps (CPU based) in the same scene ( 768 animations), so a 750% performance increase. In comparison I can play 2944 animations at the same time in Debug (GPU based) at about 20 fps.

Suggestions for Improvements

LOD-ing the models would greatly increase the performance because the number of vertices are directly related to performance.
Group culling, would lower the complexity of the game loops and give a good overall fps increase I believe.
Utilise the Textures better, in my implementation there's only one column of animations. This gives me about 4.5 min of animation time at 60 fps and 1365 bones, which was more then enough for me. But if you utilise this more, you can get about 1 hour of animation tome at 60 fps with 100 bones.
Using a structured buffer might be preferred in Direct X 11/12. But I don't think there are any big performance increases other than memory.
Better and more varied behaviours to make the scenes more diverse. But it's important to keep them light weight. The behaviour I made were pretty rushed.
Save the textures on disk for faster load times. Even though the load times were fast in my opinion, this could be a good think to implement anyway.