1

I have a performance critical piece of code, where I need to check one array for values below a threshold and then conditionally set the values of two other arrays. My code looks like this:

#include <Eigen/Dense>

int main(){
    Eigen::ArrayXXd
        a (1, 100),
        b (2, 100),
        c (3, 100);
    
    a.setRandom();
    b.setRandom();
    c.setRandom();
    
    constexpr double minVal { 1e-8 };
    
    /* the code segment in question */
    /* option 1 */
    for ( int i=0; i<2; ++i ){
        b.row(i)   = (a < minVal).select( 0, c.row(i+1) / a );
        c.row(i+1) = (a < minVal).select( 0, c.row(i+1) );
    }
    /* option 2, which is slower */
    b = (a < minVal).replicate(2,1).select( 0, c.bottomRows(2) / a.replicate(2,1) );
    c.bottomRows(2) = (a < minVal).replicate(2,1).select( 0, c.bottomRows(2) );

    return 0;
}

The array a, whose values are checked for reaching the threshold minVal, has one row and a dynamic number of columns. The other two arrays b and c have two and three rows, respectively, and the same number of columns as a.

Now I would like to do the above logic in a more eigen way, without that loop in option 1, because typically, eigen has tricks up its sleeve for performance, that I can never hope to match when writing raw loops. However, the only way I could think of was option 2, which is noticeably slower than option 1.

What would be the right and efficient way to do the above? Or is the loop already my best option?

RL-S
  • 734
  • 6
  • 21

1 Answers1

1

You can try the following:

  • Define your array types with fixed number of rows and dynamic number of columns, i.e., you can replace Eigen::ArrayXXd with Eigen::Array<double, 1/2/3, Eigen::Dynamic>.
  • Use fixed-size version of block operations (see https://eigen.tuxfamily.org/dox/group__TutorialBlockOperations.html), i.e., you can replace bottomRows(N) with bottomRows<N>() and similarly replicate(2,1) with replicate<2,1>().

I have changed the array types in your code and included a third option with the possible improvements that I have mentioned:

#include <Eigen/Dense>

#include <iostream>
#include <chrono>

constexpr int numberOfTrials = 1000000;
constexpr double minVal{ 1e-8 };

typedef Eigen::Array<double, 1, Eigen::Dynamic> Array1Xd;
typedef Eigen::Array<double, 2, Eigen::Dynamic> Array2Xd;
typedef Eigen::Array<double, 3, Eigen::Dynamic> Array3Xd;

inline void option1(const Array1Xd& a, Array2Xd& b, Array3Xd& c)
{
    for (int i = 0; i < 2; ++i) {
        b.row(i) = (a < minVal).select(0, c.row(i + 1) / a);
        c.row(i + 1) = (a < minVal).select(0, c.row(i + 1));
    }
}

inline void option2(const Array1Xd& a, Array2Xd& b, Array3Xd& c)
{
    b = (a < minVal).replicate(2, 1).select(0, c.bottomRows(2) / a.replicate(2, 1));
    c.bottomRows(2) = (a < minVal).replicate(2, 1).select(0, c.bottomRows(2));
}

inline void option3(const Array1Xd& a, Array2Xd& b, Array3Xd& c)
{
    b = (a < minVal).replicate<2, 1>().select(0, c.bottomRows<2>() / a.replicate<2, 1>());
    c.bottomRows<2>() = (a < minVal).replicate<2, 1>().select(0, c.bottomRows<2>());
}

int main() {
    Array1Xd a(1, 100);
    Array2Xd b(2, 100);
    Array3Xd c(3, 100);

    a.setRandom();
    b.setRandom();
    c.setRandom();

    auto tpBegin1 = std::chrono::steady_clock::now();
    for (int i = 0; i < numberOfTrials; i++)
        option1(a, b, c);
    auto tpEnd1 = std::chrono::steady_clock::now();

    auto tpBegin2 = std::chrono::steady_clock::now();
    for (int i = 0; i < numberOfTrials; i++)
        option2(a, b, c);
    auto tpEnd2 = std::chrono::steady_clock::now();

    auto tpBegin3 = std::chrono::steady_clock::now();
    for (int i = 0; i < numberOfTrials; i++)
        option3(a, b, c);
    auto tpEnd3 = std::chrono::steady_clock::now();

    std::cout << "(Option 1) Average execution time: " << std::chrono::duration_cast<std::chrono::microseconds>(tpEnd1 - tpBegin1).count() / (long double)(numberOfTrials) << " us" << std::endl;
    std::cout << "(Option 2) Average execution time: " << std::chrono::duration_cast<std::chrono::microseconds>(tpEnd2 - tpBegin2).count() / (long double)(numberOfTrials) << " us" << std::endl;
    std::cout << "(Option 3) Average execution time: " << std::chrono::duration_cast<std::chrono::microseconds>(tpEnd3 - tpBegin3).count() / (long double)(numberOfTrials) << " us" << std::endl;

    return 0;
}

Average execution times that I have obtained are as follows (i7-9700K, msvc2019, optimizations enabled, NDEBUG):

(Option 1) Average execution time: 0.527717 us
(Option 2) Average execution time: 3.25618 us
(Option 3) Average execution time: 0.512029 us

And with AVX2+OpenMP enabled:

(Option 1) Average execution time: 0.374309 us
(Option 2) Average execution time: 3.31356 us
(Option 3) Average execution time: 0.260551 us

I'm not sure if it is the most "Eigen" way but I hope it helps!

puhu
  • 176
  • 1
  • 8
  • I can't use fixed row arrays, because otherwise I can't use a container for them. But I'll definitely try fixed block sizes and have a look whether that fixes it! – RL-S Aug 16 '20 at 14:03
  • 1
    I have only replaced the array types with ArrayXXd and kept using fixed-size operations. Obtained execution times are: `(Option 1) 0.50138 us, (Option 2) 6.46162 us, (Option 3) 2.1247 us`. Even though there is still an improvement, it seems that fixed-size matrices were mainly responsible for the previous performance. If you are able to share I want to ask: what do you mean by using a container? Are you using another library and interchanging array types for different operations? – puhu Aug 16 '20 at 14:50
  • The ```ArrayXXd```s are data members of a class, which is then stored in a container (std::array) that I'm looping over. So I'd have to template the class, which results in different types for different numbers of rows. Then I can't store them in a container anymore (except with std::variant, but that has quite the overhead, too). However, there are not terribly many of them, so I'll keep that as an option. But given Option 1 works either way, I'll go that route first. Thanks a lot for your effort! – RL-S Aug 17 '20 at 09:03
  • So far, I thought that changing just **one** size parameter to fixed-size wouldn't have such a performance impact, because if the other one is dynamic, then the array still has to be allocated on the heap. Any idea, why it could make such a difference? – RL-S Aug 17 '20 at 09:19
  • For anyone trying this at home: When using ```auto``` to create an eigen expression, and then using a templated member function, you may run into a really weird ```error: expected unqualified-id before numeric constant```. Then you need to add ```template``` before the function call, e.g. ```(myEigenXpr * otherEigenXpr).template replicate<2,1>()``` Credits to this thread: https://stackoverflow.com/questions/4942703/why-do-i-get-an-error-trying-to-call-a-template-member-function-with-an-explicit – RL-S Aug 17 '20 at 10:30