xtensor's "operator/" slower than numpy's "/"

Question

I'm trying to transfer some code I've previously written in python into C++, and I'm currently testing xtensor to see if it can be faster than numpy for doing what I need it to.

One of my functions takes a square matrix d and a scalar alpha, and performs the elementwise operation alpha/(alpha+d). Background: this function is used to test which value of alpha is 'best', so it is in a loop where d is always the same, but alpha varies.

All of the following time scales are an average of 100 instances of running the function.

In numpy, it takes around 0.27 seconds to do this, and the code is as follows:

def kfun(d,alpha):
    k = alpha /(d+alpha)
    return k

but xtensor takes about 0.36 seconds, and the code looks like this:

xt::xtensor<double,2> xk(xt::xtensor<double,2> d, double alpha){
    return alpha/(alpha+d);
}

I've also attempted the following version using std::vector but this something I do not want to use in long run, even though it only took 0.22 seconds.

std::vector<std::vector<double>> kloops(std::vector<std::vector<double>> d, double alpha, int d_size){
    for (int i = 0; i<d_size; i++){
        for (int j = 0; j<d_size; j++){
            d[i][j] = alpha/(alpha + d[i][j]);
        }
    }
    return d;
}

I've noticed that the operator/ in xtensor uses "lazy broadcasting", is there maybe a way to make it immediate?

EDIT:

In Python, the function is called as follows, and timed using the "time" package

t0 = time.time()
for i in range(100):
    kk = k(dsquared,alpha_squared)
print(time.time()-t0)

In C++ I call the function has follows, and is timed using chronos:

//d is saved as a 1D npy file, an artefact from old code
auto sd2 = xt::load_npy<double>("/path/to/d.npy");

shape = {7084, 7084};
    xt::xtensor<double, 2> xd2(shape);
    for (int i = 0; i<7084;i++){
        for (int j=0; j<7084;j++){
            xd2(i,j) = (sd2(i*7084+j));
        }
    }

auto start = std::chrono::steady_clock::now();
for (int i = 0;i<10;i++){
    matrix<double> kk = kfun(xd2,4000*4000,7084);
}
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "k takes: " << elapsed_seconds.count() << "\n";

If you wish to run this code, I'd suggest using xd2 as a symmetric 7084x7084 random matrix with zeros on the diagonal.

The output of the function, a matrix called k, then goes on to be used in other functions, but I still need d to be unchanged as it will be reused later.

END EDIT

To run my C++ code I use the following line in the terminal:

cd "/path/to/src/" && g++ -mavx2 -ffast-math -DXTENSOR_USE_XSIMD -O3 ccode.cpp -o ccode -I/path/to/xtensorinclude && "/path/to/src/"ccode

Thanks in advance!

Good question! A general remark to improve the question is that a minimal reproducible example would be better. Specifically you could include the few lines you use to call the function. That would make it easier to judge the subtleties regarding the copies. A more detailed point is that your `std::vector` example does not seem to allocate the return. Furthermore, in all generality you should use `for (int i = 0; i — Tom de Geus, Mar 20 '21 at 10:30
@TomdeGeus Hi! Thank you for your comment. I just wanted to clarify, obviously I'm quite new to this, but I guess I assumed that it would be faster for the function if I just specified the size rather than asking it to work it out each time? Is this wrong? This function is called in a loop, specifically with varying values of alpha. Also, what do you mean my std::vector example doesn't allocate the return? I know you can do a void function for one that makes a change to an input, for example, have I accidentally done that rather than outputting an altered "d"? — abic011, Mar 22 '21 at 09:32
The compiler will probably optimise out (some of the) size calls, but honestly your `d_size` option fell out of my screen, so I did not notice is and assumed you may have had a typo. For the vector example you had some `d2` that was not defined, but you corrected this no, so all is good! — Tom de Geus, Mar 22 '21 at 10:28
A minor comment concerning the question. With the latest edit the question has gotten much better. What would be even better is to ensure that it is reproducible: that anyone can copy you code snippet and compile and run it directly. To that end you could introduce `dsquared` and `xd2` simply as a matrix of random numbers — Tom de Geus, Mar 22 '21 at 10:37
@TomdeGeus would you mind, is what I've written at the bottom of the edit ok? Or should I upload an example? — abic011, Mar 22 '21 at 10:56
The edit indeed clarifies. However, there is still some room for improvement. There is some reference [here](https://stackoverflow.com/help/minimal-reproducible-example). For your example: your code is not stand-alone. For example: you refer to `k` in the Python code but the function in `kfun` above. Similarly, the function is `xk` for xtensor, but called with `kfun`. Finally you use `matrix` with is not specified, reproducible would have been using `auto` or `xt::xtensor`, or specifying `matrix`. — Tom de Geus, Mar 23 '21 at 08:58
@TomdeGeus You're right, sorry about that. I'd tried to be consistent in my function names, but changing them from the ones I actually use so it would be more readable to someone else, however I think I must have forgotten to change this with the edits. When asking questions in the future I'll try to remember this! Thank you for your feedback, it's been really helpful — abic011, Mar 23 '21 at 09:51
You are welcome. For future reference I do think it is still good to edit here to be consistent. — Tom de Geus, Mar 23 '21 at 14:34

emmenlau · Accepted Answer · 2021-04-01T07:56:30.267

3

A problem with the C++ implementation may be that it creates one or possibly even two temporary copies that could be avoided. The first copy comes from not passing the argument by reference (or perfect forwarding). Without looking at the rest of the code its hard to judge if this has an impact on the performance or not. The compiler may move d into the method if its guaranteed to be not used after the method xk(), but it is more likely to copy the data into d.

To pass by reference, the method could be changed to

xt::xtensor<double,2> xk(const xt::xtensor<double,2>& d, double alpha){
    return alpha/(alpha+d);
}

To use perfect forwarding (and also enable other xtensor containers like xt::xarray or xt::xtensor_fixed), the method could be changed to

template<typename T>
xt::xtensor<double,2> xk(T&& d, double alpha){
    return alpha/(alpha+d);
}

Furthermore, its possible that you can save yourself from reserving memory for the return value. Again, its hard to judge without seeing the rest of the code. But if the method is used inside a loop, and the return value always has the same shape, then it can be beneficial to create the return value outside of the loop and return by reference. To do this, the method could be changed to:

template<typename T, typename U>
void xk(T& r, U&& d, double alpha){
    r = alpha/(alpha+d);
}

If it is guaranteed that d and r do not point to the same memory, you can further wrap r in xt::noalias() to avoid a temporary copy before assigning the result. The same is true for the return value of the function in case you do not return by reference.

Good luck and happy coding!

edited Apr 01 '21 at 07:56

answered Mar 19 '21 at 18:21

emmenlau

958
12
20

Hi! Thank you for your answer! I just wanted to clarify because I don't think I specified in my question, I need to keep the original d matrix, and also output a matrix called "k", this is because I essentially need to try lots of different values of alpha to see which gives the "best" result. I believe what you're suggesting wouldn't allow that, is that correct? Apologies, this is my first stackoverflow question, I think I didn't give enough information. – abic011 Mar 22 '21 at 09:39
1

What for me is important in this question on the C++ side is that @abic011 finds that the vector example is quite faster, while that in principle contains the same copies that you refer to in this answer. So, eliminating the copies may explain the difference with NumPy that indeed would not copy either (it would be nice if @abic011 can confirm, but not with the `std::vector` version. With the same number of copies both should be at worst equally fast/slow – Tom de Geus Mar 22 '21 at 10:33
Dear @abic001, in response to the question about keeping matrix `d` unchanged, this is no problem. You can still pass `d` as const reference (my first suggestion) and the method call will be a lot faster. Matrix `d` will not be changed by the function. About Matrix `k`, maybe edit your question to show its usage? – emmenlau Mar 22 '21 at 10:45
@TomdeGeus I've assumed that the reason std::vector is faster than the other versions is because the standard library has the fastest method of accessing elements, although I could be wrong? Sorry, I'm not sure exactly what you'd like me to confirm, can you explain a little more? Unfortunately @emmenlau passing `d` as `const` is consistently taking around 0.38 seconds, currently on my laptop the numpy is taking about 0.28. – abic011 Mar 22 '21 at 11:18
1

Apologies! When copying over I miss-typed and only just realised! It's actually now 0.2 seconds, so much better, thank you so much! – abic011 Mar 22 '21 at 11:58
1

@abic011 Great to hear that it solves everything. Just to make sure it is not so much `const` that does it but `&`. The latter forces objects to be passed by reference (only a memory address) rather than as a copy (the entire data, in your case 49 million entries) – Tom de Geus Mar 23 '21 at 09:01

xtensor's "operator/" slower than numpy's "/"

1 Answers1