c++ std::async slower then sequential for loop

Question

I am trying to create a physics engine for a custom game engine. At the moment everything works fine however i am having some performance issues when the engine has to deal with approximately 4000 physics bodies. I am quite certain this is not the fault of the render engine as it uses instanced rendering for particle effects (witch i am currently testing) and can handle around 200K particles if they are all static.

so far once all the collisions have been resolved i update all of the physics bodies in the scene by applying a gravity force and translating the bodies by there velocity

the function looks like this:

void mint::physics::PhysicsEngine::SymplecticEuler(mint::physics::PhysicsBody* body)
{
  mint::graphics::Entity *entity = body->GetEntity();

  // -- Symplectic Euler
  glm::vec2 gravity = glm::vec2(0.0f, (1.0f / core::Timer::Instance()->DeltaTime()) * 9.81f) * body->GravityScale();

  glm::vec2 dv = (body->Force() * body->GetMassData()->inv_mass + gravity * core::Timer::Instance()->DeltaTime());
  body->Velocity(body->Velocity() +  dv);

  glm::vec2 dxy = glm::vec2(body->Velocity() * core::Timer::Instance()->DeltaTime());
  entity->Translate(glm::vec3(dxy, 0.0f));
  // -- END -- Symplectic Euler

  // -- update the collider
  body->UpdateCollider();
  // -- END -- update the collider
}

this function will run once per physics body and is called in a for loop like so

auto start = std::chrono::high_resolution_clock::now();
for (auto body : all_bodys)
{
    //SymplecticEuler(body);
    // -- using std::async
    fEulerFutures.push_back(std::async(std::launch::async, SymplecticEuler, body));
    //SymplecticEuler(body);
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<float> duration = end - start;
std::cout << "physics update took: " << duration.count() << std::endl;

i am using std::chrono to see how long the update too and i have two different ways of implementing this, one is just by calling SymplecticEuler(body) and the other way is by using std::async and that future that is returned from the function is stored in a member vector of the physics engine class witch is cleared once every update

using the timing code i wrote it took the sequential loop 0.00014s and the multithreaded loop took 0.005s. I would not expect the multithreaded loop to take longer then the sequential loop but it did so i am assuming that i am either using std::async wrong or am using it in the wrong context. The program i am running this on is running a simple particle simulation with 300 particles so nothing to big yet.

Can someone please let me know if i am using std::async correctly because i am still very new to the concept of multithreading or if i am using too many threads to slow down the performance of the engine or if i should use compute shaders instead of multithreading (if the use of compute shaders would improve the performance of the engine please leave some links for tutorials of how to use compute shaders in modern openGL with c++)

both these functions are members of a physics engine class and the SymplecticEuler() function is a static function

Thanks

If you are using gcc 9 (or later) or MSVC, you could try using a parallel execution policy: `std::for_each(std::execution::par, all_bodys.begin(), all_bodys.end(), [](auto body) { SymplecticEuler(body); });` — Ted Lyngmo, Apr 21 '20 at 16:47

score 3 · Answer 1 · answered Apr 21 '20 at 16:28

3

I would not expect the multithreaded loop to take longer then the sequential loop

I think that's your problem right there, why would you think it would take less? The amount of work to push tasks to a concurrent data structure (which likely involves mutexes if written poorly, or at least cmpxchg instructions otherwise), and then signaling a kernel synchronization object (an event in Windows), and having a thread woken up by the kernel thread scheduler in response which then has to access your data structure again in a thread-safe way to remove the task -- that's an insane amount of work.

Multithreading in general adds a lot more work for the CPU (and library writers), the gains are that the work can happen on other threads, leaving your thread to respond to GUI events instead of freezing. For that reason, you want the overhead to be orders of magnitude smaller than the amount of work you queue, and that is not the case for you -- all you have is a few SIMD instructions.

You might find an increase in speed if you group a few hundred/thousand of those updates per task, and if you don't have enough of them, just run them all as a task.

answered Apr 21 '20 at 16:28

Blindy

65,249
10
91
131

The reason i wanted to use a multithreading loop is so that it will loop through all the objects and update the in parallel, at the same time instead of all at once. i am also doing this to learn more about multithreading too. – Ethan Hofton Apr 21 '20 at 16:32
I understand, but as I said the overhead of queuing and dequeuing these things in a thread-safe way is higher than the amount of work you're queuing in the first place. Batch your work in larger amounts of work. – Blindy Apr 21 '20 at 16:49
Would it be better to do this using compute shaders or is that still too much work? – Ethan Hofton Apr 21 '20 at 16:49
Or too little work, you mean? Because really, a translation and a vector addition is literally nothing! – Blindy Apr 21 '20 at 17:40
With a position update it then has to update the model matrix of the object and then use the model matrix to recalculate the world position of the shape, and I cannot do this in the vertex shader as I am batch rendering. Perhaps I should look at moving the matrix calculation onto the gpu using a computer shader? – Ethan Hofton Apr 21 '20 at 17:43
I see your point now, the bottleneck was not in this function and using multi threading is just a waste of time. Any ideas on how to make the matrix and world coordinates calculations any faster? – Ethan Hofton Apr 21 '20 at 17:57
SIMD is the answer, either on the CPU (which I believe `glm` should do if you enable SIMD at the compiler level) or on the GPU by passing the matrices as they are, and multiplying in the shader. – Blindy Apr 21 '20 at 18:15
I will have a look at SIMID, and about passing the matrices if I have a lot of objects to render won’t it cause I big bottle neck if I am passing over multiple matrices for a large amount of objects? – Ethan Hofton Apr 21 '20 at 18:18
It depends. Of course not sending the matrices at all is best, but if you have them, you need to calculate their multiplication somewhere. – Blindy Apr 21 '20 at 19:14

c++ std::async slower then sequential for loop

1 Answers1