I am trying to understand data-oriented design on a simple, specific problem. Apologies in advance to data-oriented design people, if I am doing something very stupid, but I am having a hard time understanding why and where my reasoning fails.
Assume that I have a simple operation, i.e., float_t result = int_t(lhs) / int_t(rhs)
. If I keep all of the variables in their corresponding containers, e.g., std::vector<float_t>
and std::vector<int_t>
, and I use std::transform
, I get the correct result. Then, for a specific example where using float_t = float
and using int_t = int16_t
, I assume that packing these variables inside a struct
, on a 64-bit architecture, and collecting them inside a container should yield better performance.
I reason that the struct
makes up a 64-bit object, and a single memory access to the struct
will give me all the variables I need. On the other hand, when all these variables are collected in different containers, I will need three different memory accesses to get the information needed. Below is how I set up the environment:
#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>
using namespace std::chrono;
template <class float_t, class int_t> struct Packed {
float_t sinvl;
int_t s, l;
Packed() = default;
Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
void comp() { sinvl = float_t(l) / s; }
};
using my_float = float;
using my_int = int16_t;
int main(int argc, char *argv[]) {
constexpr uint32_t M{100};
for (auto N : {1000, 10000, 100000}) {
double t1{0}, t2{0};
for (uint32_t m = 0; m < M; m++) {
std::vector<my_float> sinvl(N, 0.0);
std::vector<my_int> s(N, 3), l(N, 2);
std::vector<Packed<my_float, my_int>> p1(
N, Packed<my_float, my_int>(0.0, 3, 2));
// benchmark unpacked
auto tstart = high_resolution_clock::now();
std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
std::divides<my_float>{}); // 3 different memory accesses
auto tend = high_resolution_clock::now();
t1 += duration_cast<microseconds>(tend - tstart).count();
if (m == M - 1)
std::cout << "sinvl[0]: " << sinvl[0] << '\n';
// benchmark packed
tstart = high_resolution_clock::now();
for (auto &elem : p1) // 1 memory access
elem.comp();
tend = high_resolution_clock::now();
t2 += duration_cast<microseconds>(tend - tstart).count();
if (m == M - 1)
std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
}
std::cout << "N = " << N << ", unpacked: " << (t1 / M) << " us.\n";
std::cout << "N = " << N << ", packed: " << (t2 / M) << " us.\n";
}
return 0;
}
The compiled code with g++ -O3
yields, on my machine,
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 1000, unpacked: 0 us.
N = 1000, packed: 1 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 10000, unpacked: 5.06 us.
N = 10000, packed: 12.97 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 100000, unpacked: 52.31 us.
N = 100000, packed: 124.49 us.
Basically, std::transform
beats the packed access by 2.5x
. I would appreciate if you helped me understand the behaviour. Is the result due to
- me not grasping the data-oriented design principles correctly, or,
- some artefact of this very simple example such as the memory locations getting allocated very close to each other and in some way getting optimized very efficiently by the compiler?
Finally, is there a way to beat std::transform
in this example, or, is it simply good enough to be a go-to solution? I am not an expert neither in compiler optimizations nor in data-oriented design, and thus, I could not answer this question myself.
Thanks!
EDIT. I have changed the way I test both of the methods as per @bolov's suggestion in the comments.
Now the code looks like:
#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>
using namespace std::chrono;
template <class float_t, class int_t> struct Packed {
float_t sinvl;
int_t s, l;
Packed() = default;
Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
void comp() { sinvl = float_t(l) / s; }
};
using my_float = float;
using my_int = int16_t;
int main(int argc, char *argv[]) {
uint32_t N{1000};
double t{0};
if (argc == 2)
N = std::stoul(argv[1]);
#ifndef _M_PACKED
std::vector<my_float> sinvl(N, 0.0);
std::vector<my_int> s(N, 3), l(N, 2);
// benchmark unpacked
auto tstart = high_resolution_clock::now();
std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
std::divides<my_float>{}); // 3 different memory accesses
auto tend = high_resolution_clock::now();
t += duration_cast<microseconds>(tend - tstart).count();
std::cout << "sinvl[0]: " << sinvl[0] << '\n';
std::cout << "N = " << N << ", unpacked: " << t << " us.\n";
#else
std::vector<Packed<my_float, my_int>> p1(N,
Packed<my_float, my_int>(0.0, 3, 2));
// benchmark packed
auto tstart = high_resolution_clock::now();
for (auto &elem : p1) // 1 memory access
elem.comp();
auto tend = high_resolution_clock::now();
t += duration_cast<microseconds>(tend - tstart).count();
std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
std::cout << "N = " << N << ", packed: " << t << " us.\n";
#endif
return 0;
}
with the corresponding shell (fish) script
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_unpacked.out
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_packed.out -D_M_PACKED
for N in 1000 10000 100000
echo "Testing unpacked for N = $N"
./transform_unpacked.out $N
./transform_unpacked.out $N
./transform_unpacked.out $N
echo "Testing packed for N = $N"
./transform_packed.out $N
./transform_packed.out $N
./transform_packed.out $N
end
which gives the following:
Testing unpacked for N = 1000
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
Testing packed for N = 1000
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
Testing unpacked for N = 10000
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
Testing packed for N = 10000
p1[0].sinvl: 0.666667
N = 10000, packed: 17 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
Testing unpacked for N = 100000
sinvl[0]: 0.666667
N = 100000, unpacked: 64 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
Testing packed for N = 100000
p1[0].sinvl: 0.666667
N = 100000, packed: 180 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 198 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 177 us.
I hope I have understood the proper testing method, correctly. Still, though, the difference is 2-3 folds.