Optimize c++ Monte Carlo simulation with long dynamic arrays

Question

This is my first post here and I am not that experienced, so please excuse my ignorance.

I am building a Monte Carlo simulation in C++ for my PhD and I need help in optimizing its computational time and performance. I have a 3d cube repeated in each coordinate as a simulation volume and inside every cube magnetic particles are generated in clusters. Then, in the central cube a loop of protons are created and move and at each step calculate the total magnetic field from all the particles (among other things) that they feel.

At this moment I define everything inside the main function and because I need the position of the particles for my calculations (I calculate the distance between the particles during their placement and also during the proton movement), I store them in dynamic arrays. I haven't used any class or function,yet. This makes my simulations really slow because I have to use eventually millions of particles and thousands of protons. Even with hundreds it needs days. Also I use a lot of for and while loops and reading/writing to .dat files.

I really need your help. I have spent weeks trying to optimize my code and my project is behind schedule. Do you have any suggestion? I need the arrays to store the position of the particles .Do you think classes or functions would be more efficient? Any advice in general is helpful. Sorry if that was too long but I am desperate...

Ok, I edited my original post and I share my full script. I hope this will give you some insight regarding my simulation. Thank you.

Additionally I add the two input files

parametersDiffusion_spher_shel.txt

parametersIONP_spher_shel.txt

#include <stdio.h> 
#include <iostream>
#include <fstream>
#include <stdlib.h>
#include <math.h>
#include <iomanip> //precision to output
//#include <time.h>
#include <ctime>
#include <cstdlib>
#include <algorithm>
#include <string>
//#include <complex>
#include <chrono> //random generator
#include <random>

using namespace std;

#define PI 3.14159265
#define tN 500000 //# of timepoints (steps) to define the arrays ONLY
#define D_const 3.0E-9 //diffusion constant (m^2/s)
#define Beq 0.16 // Tesla
#define gI 2.6752218744E8 //(sT)^-1

int main(){

  //Mersenne Twister random engine
  mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
  //uniform_int_distribution<int> intDist(0,1);
  uniform_real_distribution<double> realDist(0.,1.);

  //for(int i=1; i<100; i++){
    //cout<<"R max: "<<Ragg-Rspm<<" r_spm: "<<(Ragg-Rspm)*sqrt(realDist(rng))<<endl;
  //}

  /////////////////////////////////////////////////////////////////////////////////////////////////////////
  //input files
  double Rionp=1.0E-8, Ragg=2.0E-7, t_tot=2.0E-2, l_tot = 3.0E-4;
  int ionpN=10, aggN=10,cubAxN=10, parN=1E5;
  int temp_ionpN, temp_aggN, temp_cubAxN, temp_parN;

  ifstream inIONP;

  inIONP.open("parametersIONP_spher_shel.txt");
  if(!inIONP){
    cout << "Unable to open IONP parameters file";
    exit(1); // terminate with error
  }

  while (inIONP>>Rionp>>Ragg>>temp_ionpN>>temp_aggN>>l_tot>>temp_cubAxN) {

  ionpN = (int)temp_ionpN;
  aggN = (int)temp_aggN;
  cubAxN = (int)temp_cubAxN;

  }

  inIONP.close();

  cout<<"Rionp: "<<Rionp<<" ionpN: "<<ionpN <<" aggN: "<<aggN<<endl;
  cout<<"l_tot: "<<l_tot<<" cubAxN: "<<cubAxN<<endl;

  ifstream indiff;

  indiff.open("parametersDiffusion_spher_shel.txt");
  if(!indiff){
    cout << "Unable to open diffusion parameters file";
    exit(1); // terminate with error
  }

  while (indiff>>temp_parN>>t_tot) {

    parN = (int)temp_parN;

  }
  indiff.close();

  cout<<"parN: "<<parN<<" t_tot: "<<t_tot<<endl;

  /////////////////////////////////////////////////////////////////////////////////////////////////
  int cubN = pow(cubAxN,3.); // total cubes
  int Nionp_tot = ionpN*aggN*cubN; //total IONP
  double f_tot = (double)Nionp_tot*(4.*PI*pow(Rionp,3.)/3.)/pow(l_tot,3.);//volume density

  //central cube
  double l_c = l_tot/(double)cubAxN;
  int Nionp_c = ionpN*aggN; //SPM in central cube
  double f_c = (double)Nionp_c*(4.*PI*pow(Rionp,3.)/3.)/pow(l_c,3.);

  cout<<"f_tot: "<<f_tot<<" Nionp_tot: "<<Nionp_tot<<" l_tot "<<l_tot<<endl;
  cout<<"f_c: "<<f_c<<" Nionp_c: "<<Nionp_c<<" l_c "<<l_c<<endl;
  cout<<"Now IONP are generated..."<<endl;

  //position of aggregate (spherical distribution IONP)
  double *x1_ionp, *x2_ionp, *x3_ionp, *theta_ionp, *phi_ionp, *r_ionp, *x1_agg, *x2_agg, *x3_agg;
  x1_ionp = new double [Nionp_tot];
  x2_ionp = new double [Nionp_tot];
  x3_ionp = new double [Nionp_tot];
  theta_ionp = new double [Nionp_tot];
  phi_ionp = new double [Nionp_tot];
  r_ionp = new double [Nionp_tot];
  x1_agg = new double [Nionp_tot];
  x2_agg = new double [Nionp_tot];
  x3_agg = new double [Nionp_tot];

  int ionpCounter = 0;
  int aggCounter = 0;
  double x1_aggTemp=0., x2_aggTemp=0., x3_aggTemp=0.;
  double ionpDist = 0.; //distance SPM-SPM

  for(int a=0; a<cubAxN; a++){ //x1-filling cubes
    for(int b=0; b<cubAxN; b++){ //x2-
      for(int c=0; c<cubAxN; c++){ //x3-

        bool far_ionp = true;

        cout<<"cube: (a, b, c): ("<<a<<", "<<b<<", "<<c<<")"<<endl;

        for(int i=0; i<aggN; i++){ //aggregate iterations
          x1_aggTemp=realDist(rng)*l_c + l_c*a - l_tot/2.; //from neg to pos filling
          x2_aggTemp=realDist(rng)*l_c + l_c*b - l_tot/2.;
          x3_aggTemp=realDist(rng)*l_c + l_c*c - l_tot/2.;

          for(int j=0; j<ionpN; j++){ //SPM iterations
          //  cout<<"SPM: "<<j<<" aggregate: "<<i<<" cube: (a, b, c): ("<<a<<", "<<b<<", "<<c<<")"<<endl;
            x1_agg[ionpCounter]=x1_aggTemp;
            x2_agg[ionpCounter]=x2_aggTemp;
            x3_agg[ionpCounter]=x3_aggTemp;

            //uniform 4pi distribution in sphere
            while(true){
              far_ionp = true; //must be updated!
              theta_ionp[ionpCounter] = 2.*PI*realDist(rng);
              phi_ionp[ionpCounter] = acos(1. - 2.*realDist(rng));
              r_ionp[ionpCounter] = (Ragg-Rionp)*sqrt(realDist(rng)); // to have uniform distribution sqrt
              x1_ionp[ionpCounter] = sin(phi_ionp[ionpCounter])*cos(theta_ionp[ionpCounter])*r_ionp[ionpCounter] + x1_agg[ionpCounter];
              x2_ionp[ionpCounter] = sin(phi_ionp[ionpCounter])*sin(theta_ionp[ionpCounter])*r_ionp[ionpCounter] + x2_agg[ionpCounter];
              x3_ionp[ionpCounter] = cos(phi_ionp[ionpCounter])*r_ionp[ionpCounter] + x3_agg[ionpCounter];

              for(int m=0; m<ionpCounter; m++){ //impenetrable IONP to each other
                ionpDist = sqrt(pow(x1_ionp[m]-x1_ionp[ionpCounter],2.)+pow(x2_ionp[m]-x2_ionp[ionpCounter],2.)+pow(x3_ionp[m]-x3_ionp[ionpCounter],2.));
          //cout<<"spmDist: "<<spmDist<<endl;
                if((j>0) && (ionpDist <= 2*Rionp)){
                  far_ionp = false;
                  cout<<"CLOSE ionp-ionp! Distanse ionp-ionp: "<<ionpDist<<endl;
                }
              }
             if(far_ionp){
               cout<<"IONP can break now! ionpCounter: "<<ionpCounter<<endl;
              break;
             }
           }

           cout<<"r_ionp: "<<r_ionp[ionpCounter]<<" x1_ionp: "<<x1_ionp[ionpCounter]<<" x2_ionp: "<<x2_ionp[ionpCounter]<<" x3_ionp: "<<x3_ionp[ionpCounter]<<endl;
           cout<<"x1_agg: "<<x1_agg[ionpCounter]<<" x2_agg: "<<x2_agg[ionpCounter]<<" x3_agg: "<<x3_agg[ionpCounter]<<endl;

           ionpCounter++;

          }
          aggCounter++;
        }
      }
    }
  }
  cout<<"ionpCounter: "<<ionpCounter<<" aggCounter: "<<aggCounter<<endl;

  //=====proton diffusion=============//

  //outfile
  //proton diffusion time-positionSPM_uniform
  FILE *outP_tPos;
  outP_tPos = fopen("V3_MAT_positionProtons_spherical.dat","wb+");
  if(!outP_tPos){// file couldn't be opened
    cerr << "Error: file could not be opened" << endl;
    exit(1);
  }

  //proton diffusion time-positionSPM_uniform
  FILE *outP_tB;
  outP_tB = fopen("V3_MAT_positionB_spherical.dat","wb+");
  if(!outP_tB){// file couldn't be opened
      cerr << "Error: file could not be opened" << endl;
      exit(1);
    }


  double *cosPhase_S, *sinPhase_S, *m_tot;
  cosPhase_S = new double [tN];
  sinPhase_S = new double [tN];
  m_tot = new double [tN];

  double tstep = 0.; // time of each step
  int stepCounter = 0; // counter for the steps for each proton
  int cnt_stpMin=0; //, cnt_stpMax=0; //counters for the step length conditions


  for (int i=0; i<parN; i++){// repetition for all the protons in the sample
    stepCounter = 0; //reset

    cout<<"Now diffusion calculated for proton: "<<i<<endl;

    double x0[3]={0.}, xt[3]={0.}, vt[3]={0.};
    double tt=0.;
    double stepL_min = Rionp/8.; //min step length
    double stepL_max = Rionp; //max step length
    double stepL = 0.;
    double extraL = 0.; //extra length beyond central cube
    bool hit_ionp = false;
    double pIONPDist = 0.; // proton-IONP Distance (vector ||)
    double pIONPCosTheta = 0.; //proton_IONP vector cosTheta with Z axis
    double Bloc = 0.; //B 1 IONP
    double Btot = 0.; //SUM B all IONP
    double Dphase = 0.; //Delta phase for step 1p
    double phase = 0.; //phase 1p
    double theta_p=0., phi_p=0.;

    //randomized initial position of the particle;
    x0[0] = realDist(rng)*l_c - l_c/2.;
    x0[1] = realDist(rng)*l_c - l_c/2.;
    x0[2] = realDist(rng)*l_c - l_c/2.;

    //for (int j=0; j<tN; j++){ //steps
    bool diffTime = true; // flag protons are allowed to diffuse (tt<10ms)
    while(diffTime){ // steps loop

     //unit vector for 4p direction
     theta_p = 2.*PI*realDist(rng);
     phi_p = acos(1. - 2.*realDist(rng));

     vt[0] = sin(phi_p)*cos(theta_p);
     vt[1] = sin(phi_p)*sin(theta_p);
     vt[2] = cos(phi_p);;

     //determine length of step
     for(int k=0; k<ionpCounter; k++){
      if(abs(sqrt(pow(x1_ionp[k]-x0[0],2.)+pow(x2_ionp[k]-x0[1],2.)+pow(x3_ionp[k]-x0[2],2.))-Rionp) <= 8*Rionp){
        //spm closer than 8R
        stepL = Rionp/8;
        cnt_stpMin ++;
        break;
      }
      else if(abs(sqrt(pow(x1_ionp[k]-x0[0],2.)+pow(x2_ionp[k]-x0[1],2.)+pow(x3_ionp[k]-x0[2],2.))-Rionp) > 8*Rionp){
        stepL = Rionp;
      }
      else{
        cout<<"sth wrong with the proton-IONP distance!"<<endl;
      }
    }

    //determine Dt step duration
    tstep = pow(stepL,2.)/(6.*D_const);
    tt += tstep;
    if(tt>t_tot){
      diffTime = false; //proton is not allowed to diffuse any longer
      cout<<"Proton id: "<<i<<" has reached diffusion time! -> Move to next one!"<<endl;
      cout<<"stepCounter: "<<stepCounter<<" cnt_stpMin: "<<cnt_stpMin<<endl;
    }

    while(true){
      xt[0]=x0[0]+vt[0]*stepL;
      xt[1]=x0[1]+vt[1]*stepL;
      xt[2]=x0[2]+vt[2]*stepL;
      for(int m=0; m<3; m++){
        if(abs(xt[m]) > l_c/2.){ //particle outside central cube,// reflected, elastic collision(no!)
          //particle enters fron the other way, periodicity
        //  hit_cx[m] = true; //I don't need it yet
          extraL = abs(xt[m]) - l_c/2.;
        //  xt[m]=-x0[m];
          cout<<"proton outside! xt[m]: "<<xt[m]<<" extra lenght: "<<extraL<<endl;
          xt[m] = xt[m]-l_c;
          cout<<"Relocating => new x[t]: "<<xt[m]<<endl;
        }
      }
      for(int k=0; k<ionpCounter; k++){//check if proton inside SPM
        pIONPDist = sqrt(pow((x1_ionp[k]-xt[0]),2.)+pow((x2_ionp[k]-xt[1]),2.)+pow((x3_ionp[k]-xt[2]),2.)) - Rionp;
        if(pIONPDist <= 0.){
          cout<<"proton inside IONP => reposition! Distance: "<<pIONPDist<<" Rionp: "<<Rionp<<endl;
          hit_ionp = true;
        }
        else if(pIONPDist > 0.){
          hit_ionp=false; //with this I don't have to reset flag in the end
          //calculations of Bloc for this position
          pIONPCosTheta = (x3_ionp[k]-xt[2])/pIONPDist;
          Bloc = pow(Rionp,3.)*Beq*(3.*pIONPCosTheta - 1.)/pow(pIONPDist,3.);
          Btot += Bloc;
          //cout<<"pSPMDist: "<<pSPMDist<<" pSPMCosTheta: "<<pSPMCosTheta<<" Bloc: "<<Bloc<<" Btot: "<<Btot<<endl;
        }
        else{
          cout<<"Something wrong with the calculation of pIONPDist! "<<pIONPDist<<endl;
          hit_ionp = true;
        }
      }

      if(!hit_ionp){
      //  hit_spm=false; //reset flag (unnessesary alreaty false)
        break;
      }
    }// end of while for new position -> the new position is determined, Btot calculated

    // Dphase, phase
    Dphase = gI*Btot*tstep;
    phase += Dphase;

    //store phase for this step
    //filled for each proton at this timepoint (step)

    cosPhase_S[stepCounter] += cos(phase);
    sinPhase_S[stepCounter] += sin(phase);

    //reset Btot
    Btot = 0.;
    stepCounter++;

   } //end of for loop step
  } //end of for loop particles

  //-----calculate the <m> the total magnetization
for(int t=0; t<tN; t++){
  m_tot[t] = sqrt(pow(cosPhase_S[t],2.) + pow(sinPhase_S[t],2.))/(double)parN;
  //cout<<"m_tot[t]: "<<m_tot[t]<<endl;
}
fclose(outP_tPos); //proton time-position
fclose(outP_tB); //proton time-B

//====== outfile data=============//

  //----- output data of SPM position---------//
  FILE *outP_S;
  outP_S = fopen("V3_MAT_positionSPM_spherical.dat","wb+");
  if(!outP_S){// file couldn't be opened
    cerr << "Error: file could not be opened" << endl;
    exit(1);
  }
  for (int i=0; i<ionpCounter; ++i){
    fprintf(outP_S,"%.10f \t %.10f \t %.10f\n",x1_ionp[i],x2_ionp[i],x3_ionp[i]);
  }
  fclose(outP_S);

  FILE *outP_agg;
  outP_agg = fopen("V3_MAT_positionAggreg_spherical.dat","wb+");
  if(!outP_agg){// file couldn't be opened
    cerr << "Error: file could not be opened" << endl;
    exit(1);
  }
  for (int j=0; j<ionpCounter; ++j){
    fprintf(outP_agg,"%.10f \t %.10f \t %.10f\n",x1_agg[j],x2_agg[j],x3_agg[j]);
  }
  fclose(outP_agg);

  FILE *outSngl;
  outSngl = fopen("V3_MAT_positionSingle_spherical.dat","wb+");
  if(!outSngl){// file couldn't be opened
    cerr << "Error: file could not be opened" << endl;
    exit(1);
  }
  int findAgg = (int)(realDist(rng)*aggN);
  int idxMin = findAgg*ionpN;
  int idxMax = idxMin + ionpN;
  for (int k=idxMin; k<idxMax; ++k){
    fprintf(outSngl,"%.10f\t%.10f\t%.10f\t%.10f\t%.10f\t%.10f\n",x1_agg[k],x2_agg[k],x3_agg[k],x1_ionp[k],x2_ionp[k],x3_ionp[k]);
  }
  fclose(outSngl);

  //delete new arrays
  delete[] x1_ionp;
  delete[] x2_ionp;
  delete[] x3_ionp;
  delete[] theta_ionp;
  delete[] phi_ionp;
  delete[] r_ionp;
  delete[] x1_agg;
  delete[] x2_agg;
  delete[] x3_agg;

  delete[] cosPhase_S;
  delete[] sinPhase_S;
  delete[] m_tot;



}

Micro-optimization might reduce the running time by 100 times, but it sounds you have to improve it by far more than that. Which lead to my second comment, I counted 7 nested loops in one of your function. Are you sure that your algorithm is the way to go? Do you have an idea of its complexity? If it is big O(n^3) there is no optimization that will bring you from 100 to a 1000000 of particles (GPUs or multithreading) — Alessandro Teruzzi, Jun 09 '21 at 14:42
I agree that the issue is your overall algorithm complexity. One micro-optimization I can think of is using squared Distances (https://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance) for `ionpDist` and thus remove the sqrt call — m88, Jun 09 '21 at 14:57
Thank you for the fast response. I have tested it for 10000 particles and 1000 protons and it takes about 5 days. The realistic and expected numbers to compare them with the experiment, I have calculated them to be 10^(10) particles and 5000-10000 protons. Please, do you have any advice or suggestion? I am thinking of defining the particles and clusters as classes and having them as separate headers, because now I only have one main and making functions instead of for loops. Also I don't know about the arrays. — FoxxyL707, Jun 09 '21 at 15:03
First, make sure you're compiling with `-O3` on GCC or `/O2` on MSVC. Then try using [Eigen](http://eigen.tuxfamily.org/dox-devel/GettingStarted.html) matrices for your data and arithmetic operations. It will help readability, but also performance since it enables vectorisation without getting your hands dirty with SIMD intrinsics and the like. — m88, Jun 09 '21 at 15:37
@FoxxyL707 separating in classes is a good thing, but necessarily is going to reduce execution time. Let's do some back of the envelop calculation: let's say that your algorithm is linear with number of particles, so 11000 takes 5 days 10^10 will take roughly 12.500 years. And I don't think your algorithm complexity is linear. So, let's say you make your code going 100 times faster, you are going to 125 years. Now, if you can multithread the code and run to gpus is doable (not easy). But if you algorithm is n^2 there is no way to get this thing done — Alessandro Teruzzi, Jun 09 '21 at 15:43
Have you considered some form of spatial subdivision (like in [Barnes-Hut](https://en.wikipedia.org/wiki/Barnes%E2%80%93Hut_simulation), even if it's for a completely different problem) technique? — Bob__, Jun 09 '21 at 16:02
@m88 thank you for the squared distances suggestion. I am compiling using g++ on a linux machine. I will try the Eigen matrices, thank you. Any other suggestions? If you don't mind. — FoxxyL707, Jun 09 '21 at 16:12
@Alesstingandro Terruzzi , so what should I do about the classes? Will they slow down the execution time too much? Sorry but I am not that experienced. I want to learn though!! Also I am thinking of multithread. I will implement it. Is there anything else that I could do, or should I change the simulation approach from the start? — FoxxyL707, Jun 09 '21 at 16:13
@Bob__ no I didn't know about it, but thank you. It seems an interesting technique that I it is worth looking at. Thanks. — FoxxyL707, Jun 09 '21 at 16:21
It is very difficult to help you without the full code, we can give you some hints but not much more. One thing, the loop over the particles inside the proton diffussion breaks when one specific particle is closer than a certain amount. This is good, it means that if you are smart in the search, the number of electrons will not impact your running time that much. — Alessandro Teruzzi, Jun 09 '21 at 17:04
@AlessandroTeruzzi I can edit my question and add the full code, if that would help. About the break in the loop I am trying to find some approximations that could speed up the calculations — FoxxyL707, Jun 09 '21 at 17:30
Clean code is easier to optimize. It would go a long way to use functions to break up the main function, use `std::vector` instead of manual memory management, and so on. — GManNickG, Jun 09 '21 at 17:48
I don't know if that helps, but the idea of the simulations is to generate the magnetic particles distributed in clusters in the simulation volume (stationary). Then protons are generated in the central cube and diffuse. At each step I have to calculate the total magnetic field they feel from the particles and the dephasing they experience. Then I have to add for every timepoint the dephasing of all the protons from the central cube to calculate the signal attenuation. It is about MRI and how to simulate the contrast in an MRI image. — FoxxyL707, Jun 09 '21 at 17:55
@FoxxyL707 there are some files referred in the code, could you attached the content as well? — Alessandro Teruzzi, Jun 09 '21 at 17:59
@GManNickG Thank you for the advice. I will implement functions. So instead of arrays I should use vectors? And what about classes? Should I create the particles as classes and have one method to create them in the simulation volume? — FoxxyL707, Jun 09 '21 at 17:59
I've noted that you have declared e.g. *three* different arrays, one for each coordinate of a particle, but you'll have a *lot* of particles, so that each x[i], y[i] and z[i] are going to be "distant" values in memory. A proper class collecting those coordinates for each particle is likely to be more cache friendly. — Bob__, Jun 09 '21 at 18:01
Consider moving this question to https://codereview.stackexchange.com/ where a more in-depth review can be given. — G. Sliepen, Jun 09 '21 at 18:07
@AlessandroTeruzzi the files referencenced are simple .txt with tnumbers that are the parameters of the simulation for instance the radius of the particle and the number of clusters. I can upload them. — FoxxyL707, Jun 09 '21 at 18:10
@FoxxyL707 I would like to run it (obviously I will a smaller sample set) and profile it — Alessandro Teruzzi, Jun 09 '21 at 18:16
@FoxxyL707 Yes, that would help. Once you have created the post there, delete this one here, as we like to avoid duplicate posts :) — G. Sliepen, Jun 09 '21 at 18:24
how many particles is running in the example added? I can see clearly 1000 protons but how many magnetics particles? The 10e5 seems to be override by the input in the file. I am a bit confused about it — Alessandro Teruzzi, Jun 09 '21 at 21:00
I have a smaller amount of particles in this input file just for plotting the clusters as a starting point to make sure that they are generated correctly. But I will need eventually to have a big number — FoxxyL707, Jun 10 '21 at 04:59

Alessandro Teruzzi · Accepted Answer · 2021-06-10T13:17:57.890

I talked the problem in more steps, first thing I made the run reproducible:

  mt19937 rng(127386261); //I want a deterministic seed

Then I create a script to compare the three output files generated by the program:

#!/bin/bash

diff V3_MAT_positionAggreg_spherical.dat V3_MAT_positionAggreg_spherical2.dat
diff V3_MAT_positionSingle_spherical.dat V3_MAT_positionSingle_spherical2.dat
diff V3_MAT_positionSPM_spherical.dat V3_MAT_positionSPM_spherical2.dat

Where the files ending in two is created by the optimized code and the other by your version.

I run your version compiling with O3 flag and marked the time (for 20 magnetic particles and 10 protons it is taking 79 seconds on my box, my architecture is not that important because we are just going to compare the differences).

Then I start refactoring steps by steps, running every small changes comparing the output files and the time, here are all the iterations:

Remove redundant else if gain 5 seconds (total run 74.0 s)

if(sqrt(pow(x1_ionp[k]-x0[0],2.)+pow(x2_ionp[k]-x0[1],2.)+pow(x3_ionp[k]-x0[2],2.)) <= 7*Rionp){
  //spm closer than 8R
  stepL = Rionp/8;
  cnt_stpMin ++;
  break;
}
else { //this was an else if and an else for error that will never happen
  stepL = Rionp;
}

At this point, I run it under the profiler and pow function stood out.

Replacing pow with square and cube gain 61 seconds (total run 13.2 s)

Simply replacing pow(x,2.) with square(x) and pow(x,3.) with cube(x) will reduce the run time by about 600%

double square(double d)
{
  return d*d;
}
double cube(double d)
{
  return d*d*d;
}

Now the gain is reduced quite a lot for each improvement, but still.

Remove redundant sqrt gain (total run 12.9 s)

 double ionpDist = square(x1_ionp[m]-x1_ionp[ionpCounter])+square(x2_ionp[m]-x2_ionp[ionpCounter])+square(x3_ionp[m]-x3_ionp[ionpCounter]);
 //cout<<"spmDist: "<<spmDist<<endl;
 if((j>0) && (ionpDist <= 4*square_Rionp)){

Introducing const variable square_Rionp and cube_Rionp (total run 12.7 s)

const double square_Rionp = square(Rionp);
const double cube_Rionp = cube(Rionp);
//replaced in the code like this
if((j>0) && (ionpDist <= 4*square_Rionp)){

Introducing variable for pi (total run 12.6 s)

const double Two_PI = PI*2.0;
const double FourThird_PI = PI*4.0/3.0;

Remove a (another) redundant else if (total run 11.9s)

if(pIONPDist <= 0.){
  cout<<"proton inside IONP => reposition! Distance: "<<pIONPDist<<" Rionp: "<<Rionp<<endl;
  hit_ionp = true;
}
else { //this was an else if without any reason
  hit_ionp=false; //with this I don't have to reset flag in the end
  //calculations of Bloc for this position
  pIONPCosTheta = (x3_ionp[k]-xt[2])/pIONPDist;
...
}

Remove another redundant sqrare root (total run 11.2 s)

const double Seven_Rionp_squared =square(7*Rionp);
...
for(int k=0; k<ionpCounter; k++){
  if(square(x1_ionp[k]-x0[0])+square(x2_ionp[k]-x0[1])+square(x3_ionp[k]-x0[2]) <= Seven_Rionp_squared){
  //spm closer than 8R
  stepL = stepL_min;
  cnt_stpMin ++;
  break;
}

I don't see many more things obvious to squeeze more performance out of it. Further optimization may require some thinking.

I did another comparison run with 50 magnetic particles and 10 protons and I have found that my version is 7 times faster then the yours and it is producing the exact same files.

I would do this exercise with the help of source control.

Your code is trivially parallelizable, but I will go that route just when you have optimized the single thread version.

EDIT

Change += with = operator (total run 6.23 s)

I have noticed that += operator is used for no reason, the substitution to operator = is a substanzial gain:

cosPhase_S[stepCounter] = cos(phase);
sinPhase_S[stepCounter] = sin(phase);

Thank you so much!! I appreciate your help so much! There is such a change with just replacing the variables! Thank you! I will definitely implement these recommentation. Also I am thinking of further improving the logic of the simulation, perhaps the generation of the particles and the repetitions of the cubes or all these controls with the protons that I do so that they are not so much time and memory consuming. I am trying to learn and implement as many things as I can, because C++ is really powerful and I want to improve myself. :) — FoxxyL707, Jun 10 '21 at 11:57
Thank you, and all the people that have mentioned suggestions. I will try to improve my code. One last question. Should I implement the particles into classes and have methods to create them or test the distances? Would it improve the simulations? Thanks. — FoxxyL707, Jun 10 '21 at 16:05
@FoxxyL707 it might, surely it will increase the readability, the key of optimization is experimentation, do a small change and run a test, make sure the result is the same and repeat the cycle. You run shouldn't be too long (I would say some around a minute). Another small thing, if you are using gcc or clang, make sure to profile O3 and O2 as well. Sometimes O2 gives you better speed performance than O3. — Alessandro Teruzzi, Jun 10 '21 at 16:59

Optimize c++ Monte Carlo simulation with long dynamic arrays

1 Answers1