3

I am creating a class that needs to store different arrays like data. Those arrays will have mutable size, but all of the arrays inside the class will have the same size. The arrays will be later used for number crunching in methods provided by the class.

What is the best/standard way of declaring that kind of data inside the class?

Solution 1 – Raw arrays

class Example {
    double *Array_1;
    double *Array_2;
    double *Array_3;
    int size; //used to store size of all arrays
};

Solution 2 – std::vector for each array

class Example {
    vector<double> Array_1;
    vector<double> Array_2;
    vector<double> Array_3;
};

Solution 3 – A struct that stores each vertex and have a std::vector of that struct

struct Vertex{
    double Var_1;
    double Var_2;
    double Var_3;
};
class Example {
    vector<Vertex> data;
};

My conclusion as a beginner would be:

Solution 1 would have the best performance but would be the hardest to implement.

Solution 3 would be elegant and easier to implement, but I would run into problems when performing some calculations because the information would not be in an array format. This means numeric regular functions that receive arrays/vectors would not work (I would need to create temporary vectors in order to do the number crunching).

Solution 2 might be the middle of the way.

Any ideas for a 4th solution would be greatly appreciated.

  • 1
    1 wont have any better performance than 2. When compiling with optimizations, `vector` pretty much gets optimized away. – NathanOliver Jul 31 '19 at 14:46
  • I think the rule of thumb is try to stick with std containers, std::vector or std::array(fixed size). If you need pointers try not to expose them, use smart pointers std::unique_ptr or std::shared_ptr (last case as this contain significant overhead) – caiomcg Jul 31 '19 at 14:47
  • the forth solution is to use `std::unique_ptr`. And it's the best if you need a replacement for the first one and nothing more. – Sopel Jul 31 '19 at 14:54
  • 1
    The general recommendation is that closely related data should be structured as a single unit. Like your `Vertex` structure. And using a vector (or array) of vertex structures is very common and often used in "mathematical calculations" using parallelism or CUDA kernels or the like. – Some programmer dude Jul 31 '19 at 14:55
  • 1
    `std::array` or `std::vector`. *Not* raw arrays. – Jesper Juhl Jul 31 '19 at 15:11
  • 2
    AoS *vs.* SoA is something of a field of research all of its own. Conceptually, parallel arrays are plainly inferior, but that doesn’t always carry the day. – Davis Herring Jul 31 '19 at 15:14
  • Note: names like `Array_1` in your `struct Vertex` are misleading - these are not arrays and not meant to be arrays. – anatolyg Jul 31 '19 at 15:17
  • 1
    @JesperJuhl: `array` is obviously wrong here—this is between `unique_ptr` and `vector`, and is really the long-rejected `dynarray` (plus the `struct` business). – Davis Herring Jul 31 '19 at 15:17
  • Most probably you will access three coordinates `x`, `y`, `z` of one vertex in one bit of code, in very close time, thus the third sample, the array of vertexes, is better for CPU caches and predictions, since three coordinates of each vertex will be placed in small memory region. – 273K Jul 31 '19 at 15:35

3 Answers3

2

Don't use raw arrays. Options 2 and 3 are reasonable, the difference depends on how you'll be traversing the data. If you'll frequently be going through the arrays individually, you should store them as in solution #2 because each vector will be stored contiguously in memory. If you'll be going through them as sets of points, then solution 3 is probably better. If you want to go with solution #2 and it's critical that the arrays always be synchronized (same size, etc.) then I would make them private and control access to them through member functions. Example:

class Example
{
private:
    vector<double> Array_1;
    vector<double> Array_2;
    vector<double> Array_3;

public:
    void Push_data(double val1, double val2, double val3) {
        Array_1.push_back(val1);
        Array_2.push_back(val2);
        Array_3.push_back(val3);
    }

    vector<double> Get_all_points_at_index(size_t index) const {
        if (index < Array_1.size())
            return {Array_1[index], Array_2[index], Array_3[index]};
        else
            throw std::runtime_error("Error: index out of bounds");
    }

    const vector<double>& Get_array1() const {
        return Array_1;
    }

    void Clear_all() {
        Array_1.clear();
        Array_2.clear();
        Array_3.clear();
    }
};

This way, users of the class aren't burdened with the responsibility of making sure they add/remove values from all the vectors evenly - you do that with your class's member functions where you have complete control over the underlying data. The accessor functions should be written such that it's impossible for a user (including you) to un-syncronize the data.

Carlton
  • 4,217
  • 2
  • 24
  • 40
  • 1
    One recommendation: `Get_all_points_at_index` should probably return `std::array`, or a special `Vertex` type, just to communicate the size guarantee (not to mention avoiding a performance hit from repeated dynamic allocation). – hegel5000 Jul 31 '19 at 15:54
1

If you are going to process big amounts of data, then solutions 1 and 2 are pretty much the same - the only meaningful difference is that solution 1 is hard to protect against memory leaks (while solution 2 deallocates your data when needed automatically).

The difference between solutions 2 and 3 is what people often call "Structure of arrays" vs "Array of structures". The runtime efficiency of these solutions depends on what your code does with them. The general principle is locality of reference. If your code frequently does number crunching only on the first component of your vertex data, then use structure of arrays (solution 2). However, any complex code will work on all of the data, so I guess solution 3 (array of structures) is the best.

Note that this example is rather pure. If your data contains elements that are sometimes used in number crunching and sometimes not (e.g. it does some transformation on two coordinates of the vertices, while leaving the third untouched), then you might need to implement some kind of in-between solution - copy only the needed data to some place, transform it and copy the results back.

anatolyg
  • 26,506
  • 9
  • 60
  • 134
0

Forget about approach 1 (as the others have mentioned) and stick to either approach 2 or 3 which best fits your needs. To me, I see your code as a part of an application/library that manages coordinates/data of a 3D space. So, you should think which operation you need to do on these 3D coordinates/data and which approach makes your code simpler or more efficient. As an example, if at some moment you need to pass the raw data of one dimension to a third-party library (e.g. for visualization stuff) you should go for approach 2.

As an concrete example, VTK (the visualization toolkit) has lots of data structures that keep 3D data in both ways, either like your 2nd approach (see vtkTypedDataArray) or your like 3rd approach (see vtkAOSDataArrayTemplate). Taking a look at them may give you some ideas.

TonySalimi
  • 8,257
  • 4
  • 33
  • 62