0

My application I want to speedup performs element-wise processing of large array (about 1e8 elements).

​The processing procedure for each element is very simple and I suspect that bottleneck could be not CPU but DRAM bandwidth. ​So I decided to study one-threaded version at first.

The system is: Windows 10 64bit, 32 GB RAM, Intel Core i7-3770S Ivybridge 1.10 GHz 4 cores, Hyperthreading enabled

Concurrency analysis

Elapsed Time:   34.425s
    CPU Time:   14.908s
        Effective Time: 14.908s
            Idle:   0.005s
            Poor:   14.902s
            Ok: 0s
            Ideal:  0s
            Over:   0s
        Spin Time:  0s
        Overhead Time:  0s
    Wait Time:  0.000s
        Idle:   0.000s
        Poor:   0s
        Ok: 0s
        Ideal:  0s
        Over:   0s
    Total Thread Count: 2
    Paused Time:    18.767s

Memory Access Analysis

Memory Access Analysis provides different CPU times for three consecutive  runs on the same amount of data ​Actual execution time was about 23 seconds as Concurrency Analysis says.

Elapsed Time:   33.526s
    CPU Time:   5.740s
    Memory Bound:   38.3%
        L1 Bound:   10.4%
        L2 Bound:   0.0%
        L3 Bound:   0.1%
        DRAM Bound: 0.8%
            Memory Bandwidth:   36.1%
            Memory Latency: 60.4%
    Loads:  12,912,960,000
    Stores: 7,720,800,000
    LLC Miss Count: 420,000
    Average Latency (cycles):   15
    Total Thread Count: 4
    Paused Time:    18.081s

Elapsed Time:   33.011s
    CPU Time:   4.501s
    Memory Bound:   36.9%
        L1 Bound:   10.6%
        L2 Bound:   0.0%
        L3 Bound:   0.2%
        DRAM Bound: 0.6%
            Memory Bandwidth:   36.5%
            Memory Latency: 62.7%
    Loads:  9,836,100,000
    Stores: 5,876,400,000
    LLC Miss Count: 180,000
    Average Latency (cycles):   15
    Total Thread Count: 4
    Paused Time:    17.913s

Elapsed Time:   33.738s
    CPU Time:   5.999s
    Memory Bound:   38.5%
        L1 Bound:   10.8%
        L2 Bound:   0.0%
        L3 Bound:   0.1%
        DRAM Bound: 0.9%
            Memory Bandwidth:   57.8%
            Memory Latency: 37.3%
    Loads:  13,592,760,000
    Stores: 8,125,200,000
    LLC Miss Count: 660,000
    Average Latency (cycles):   15
    Total Thread Count: 4
    Paused Time:    18.228s

As far as I understand the Summary Page, the situation is not very good.

The paper Finding your Memory Access performance bottlenecks says that the reason is so-called false sharing. But I do not use multithreading, all processing is performed by  just one thread.

From the other hand according to Memory Access Analysis/Platform Page DRAM Bandwidth is not bottleneck.

So the questions are

  1. Why CPU times metric values are different for Concurrency Analysis and Memory Access Analysis
  2. What is the reason of bad memory metrics values, especially for L1 Bound?

The main loop is lambda function, where

  • tasklets: std::vector of simple structures that contain coefficients for data processing
  • points: data itself, Eigen::Matrix
  • projections: Eigen::Matrix, array to put results of processing into

The code is:

#include <iostream>
#include <future>
#include <random>

#include <Eigen/Dense>

#include <ittnotify.h>

using namespace std;

using Vector3 = Eigen::Matrix<float, 3, 1>;
using Matrix3X = Eigen::Matrix<float, 3, Eigen::Dynamic>;

uniform_real_distribution<float> rnd(0.1f, 100.f);
default_random_engine gen;

class Tasklet {
public:
    Tasklet(int p1, int p2)
        :
        p1Id(p1), p2Id(p2), Loc0(p1)
    {
        RestDistance = rnd(gen);
        Weight_2 = rnd(gen);
    }
    __forceinline void solve(const Matrix3X& q, Matrix3X& p)
    {
        Vector3 q1 = q.col(p1Id);
        Vector3 q2 = q.col(p2Id);
        for (int i = 0; i < 0; ++i) {
            Vector3 delta = q2 - q1;
            float norm = delta.blueNorm() * delta.hypotNorm();
        }
        Vector3 deltaQ = q2 - q1;
        float dist = deltaQ.norm();
        Vector3 deltaUnitVector = deltaQ / dist;
        p.col(Loc0) = deltaUnitVector * RestDistance * Weight_2;
    }

    int p1Id;
    int p2Id;
    int Loc0;
    float RestDistance;
    float Weight_2;
};

typedef vector<Tasklet*> TaskList;

void
runTest(const Matrix3X& points, Matrix3X& projections, TaskList& tasklets)
{
    size_t num = tasklets.size();
    for (size_t i = 0; i < num; ++i) {
        Tasklet* t = tasklets[i];
        t->solve(points, projections);
    }
}

void
prepareData(Matrix3X& points, Matrix3X& projections, int numPoints, TaskList& tasklets)
{
    points.resize(3, numPoints);
    projections.resize(3, numPoints);
    points.setRandom();
    /*
    for (int i = 0; i < numPoints; ++i) {
    points.col(i) = Vector3(1, 0, 0);
    }
    */
    tasklets.reserve(numPoints - 1);
    for (int i = 1; i < numPoints; ++i) {
        tasklets.push_back(new Tasklet(i - 1, i));
    }

}

int
main(int argc, const char** argv)
{
    // Pause VTune data collection
    __itt_pause();
    cout << "Usage: <exefile> <number of points (in thousands)> <#runs for averaging>" << endl;

    int numPoints = 150 * 1000;
    int numRuns = 1;
    int argNo = 1;

    if (argc > argNo) {
        istringstream in(argv[argNo]);
        int i;
        in >> i;
        if (in) {
            numPoints = i * 1000;
        }
    }
    ++argNo;
    if (argc > argNo) {
        istringstream in(argv[argNo]);
        int i;
        in >> i;
        if (in) {
            numRuns = i;
        }
    }
    cout
        << "Running test" << endl
        << "\t NumPoints (thousands): " << numPoints / 1000. << endl
        << "\t # of runs for averaging: " << numRuns << endl;

    Matrix3X q, projections;
    TaskList tasklets;

    cout << "Preparing test data" << endl;

    prepareData(q, projections, numPoints, tasklets);

    cout << "Running test" << endl;

    // Resume VTune data collection
    __itt_resume();
    for (int r = 0; r < numRuns; ++r) {
        runTest(q, projections, tasklets);
    }
    // Pause VTune data collection
    __itt_pause();

    for (auto* t : tasklets) {
        delete t;
    }

    return 0;
}

Thank you.

  • Added code to the question – user2351152 Nov 05 '16 at 19:59
  • Is there a reason you're showing so little of the code? I could be a lot more helpful if I didn't have to guess almost everything about it. Are your matrices row-major of column-major? Are your tasklets contiguous? What even is a tasklet composed of? How much time is spent on each line of this code? How big are your matrices? Are `t->p1Id` and `t->p2Id` randomly or sequentially accessing columns? – Veedrac Nov 05 '16 at 20:29
  • From what I see I'm not surprised you're spending a large fraction of your time waiting on DRAM and L1 cache. You're doing random access on your critical path. What else would you expect? – Veedrac Nov 05 '16 at 20:32
  • FWIW, use `deltaUnitVector = deltaQ.normalized()`. – Veedrac Nov 05 '16 at 20:42
  • Thank you for the comments. Access is sequential indeed. I simplified main loop to show the actual data access. L1 bound stays the same (about 10%). And I do not understand why CPU Time metric given by Memory Access Analysis differs from CPU Time metric given by Concurrency Analysis. – user2351152 Nov 06 '16 at 19:10
  • You're doing twice as many loads as stored, which hints that `q2` isn't getting reused from the previous iteration. What happens if you do so manually by writing `q1 = q2` inside the loop instead? Also, what happens if you make the input pointers `__restrict__`? – Veedrac Nov 06 '16 at 19:43
  • I don't know why the CPU times are different. One can only imagine that the larger includes more things than the other; eg. includes stall times that don't involve context switches. – Veedrac Nov 06 '16 at 19:45
  • @Veedrac __restrict__ does not change the picture. On q1/q2 reusing. It seems CPU cache is intended to handle such cases, isn't it? – user2351152 Nov 07 '16 at 08:46
  • You're spending a lot of time waiting on L1, which *is* the cache. So the cache is working - you're not waiting on L2 or lower - but that doesn't stop the fact that waiting on L1 is still slower than *not* doing so. My suggestions are to investigate that, since I can't run the code locally. – Veedrac Nov 07 '16 at 13:32
  • I updated the question so it includes full code. I have removed service code only. – user2351152 Nov 07 '16 at 18:17
  • What compiler and flags are you using? I find with `g++` using `-march=native` gives a ~2.5x IPC improvement in this loop, and `-ffast-math` improves that to ~3. You can get further improvements my making `TaskList` a `vector` instead of a `vector`. My compiler was obviously generating very poor code before. – Veedrac Nov 07 '16 at 19:18

0 Answers0