Table of Contents

Finally, I managed to finish the adventure with my particle system! This time I’d like to share some thoughts about improvements in the OpenGL renderer.

Code was simplified and I got little performance improvement.

The Series  

The most recent repo: particles/renderer_opt @github

Where we are?  

As I described in the post about my current renderer, I use quite a simple approach: copy position and color data into the VBO buffer and then render particles.

Here is the core code of the update proc:

glBindBuffer(GL_ARRAY_BUFFER, m_bufPos);
ptr = m_system->getPos(...);
glBufferSubData(GL_ARRAY_BUFFER, 0, size, ptr);

glBindBuffer(GL_ARRAY_BUFFER, m_bufCol);
ptr = m_system->getCol(...)
glBufferSubData(GL_ARRAY_BUFFER, 0, size, ptr);

The main problem with this approach is that we need to transfer data from system memory into GPU. GPU needs to read that data, whether is is explicitly copied into GPU memory or read directly through GART, and then it can use it in a draw call.

It would be much better to be just on the GPU side, but this is too complicated at this point. Maybe in the next version of my particle system I’ll implement it completely on GPU.

Still, we have some options to increase performance when doing CPU to GPU data transfer.

Basic Checklist  

  • Disable VSync! - OK
    • Quite easy to forget, but without this we could not measure real performance!
    • Small addition: do not use blocking code like timer queries too much. When done badly it can really spoil the performance! GPU will simply wait till you read a timer query!
  • Single draw call for all particles - OK
    • doing one draw call per a single particle would obviously kill the performance!
  • Using Point Sprites - OK
    • An interesting test was done at geeks3D that showed that points sprites are faster than geometry shader approach. Even 30% faster on AMD cards, between 5% to 33% faster on NVidia GPUs. Additional note on geometry shader from
    • Of course point sprites are less flexible (do not support rotations), but usually we can live without that.
  • Reduce size of the data - Partially
    • I send only pos and col, but I am using full FLOAT precision and 4 components per vector.
    • Risk: we could reduce vertex size, but that would require doing conversions. Is it worth it?

The numbers  

Memory transfer:

  • In total I use 8 floats per vertex/particle. If a particle system contains 100k particles (not that much!) we transfer 100k * 8 * 4b = 3200k = ~ 3MB of data each frame.
  • If we want to use more particles, like 500k, it’ll be around 15MB each frame.

In my last CPU performance tests I got the following numbers: one frame of simulations for each effect (in milliseconds).

Now we need to add the GPU time + memory transfer cost.

Below you can find a simple calculator

Position Attribute:
Elements: Size per element: bytes

Color Attribute:
Elements: Size per element: bytes

Number of particles


Our Options  

As I described in details in the posts about Persistent Mapped Buffers (PMB )I think it’s obvious we should use this approach.

Other options like: buffer orphaning, mapping, etc… might work, but the code will be more complicated I think.

We can simply use PMB with 3x of the buffer size (triple buffering) and probably the performance gain should be the best.

Here is the updated code:

The creation:

const GLbitfield creationFlags = GL_MAP_WRITE_BIT |
const GLbitfield mapFlags = GL_MAP_WRITE_BIT | 
const unsigned int BUFFERING_COUNT = 3;
const GLsizeiptr neededSize = sizeof(float) * 4 * 
        count * BUFFERING_COUNT;

glBufferStorage(GL_ARRAY_BUFFER, neededSize,
                nullptr, creationFlags);

mappedBufferPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0, 
                  neededSize, mapFlags);

The update:

float *posPtr = m_system->getPos(...)
float *colPtr = m_system->getCol(...)
const size_t maxCount = m_system->numAllParticles();

// just a memcpy        
mem = m_mappedPosBuf + m_id*maxCount * 4;
memcpy(mem, posPtr, count*sizeof(float) * 4);
mem = m_mappedColBuf + m_id*maxCount * 4;
memcpy(mem, colPtr, count*sizeof(float) * 4);

// m_id - id of current buffer (0, 1, 2)

My approach is quite simple and could be improved. Since I have a pointer to the memory I could pass it to the particle system. That way I would not have to memcpy it every time.

Another thing: I do not use explicit synchronization. This might cause some issues, but I haven’t observed that. Triple buffering should protect us from race conditions. Still, in real production code I would not be so optimistic :)


Initially (AMD HD 5500):


Reducing vertex size optimization  

I tried to reduce vertex size. I’ve even asked a question on StackOverflow:

How much perf can I get using half_floats for vertex attribs?

We could use GL_HALF_FLOAT or use vec3 instead of vec4 for position. And we could also use RGBA8 for color.

Still, after some basic tests, I did not get much performance improvement. Maybe because I lost a lot of time for doing conversions.

What’s Next  

The system with its renderer is not that slow. On my system I can get decent 70..80FPS for 0.5mln of particles! For 1 million particle system it drops down to 30… 45FPS which is also not that bad!

I would like to present some more ’extraordinary’ data and say that I got 200% perf update. Unfortunately it was not that easy… definitely, the plan is to move to the GPU side for the next version. Hopefully there will be more space for improvements.

Read next: Summary