Meshes: "Sorting/Reordering" Arrays Referencing Shared Entries of Another for Cache Efficiency

Question

Given a vertex array: {v1, v2, v3, v4, v5, ..., vN}

And K polygons indexing it with blocks like so for a sample 4-sided polygon*: {v7, v2, v51, v16}

_{Note that two or more polygons may share the same vertex. In fact, most vertices will be shared by 4-6 polygons (valence 4 for quad meshes, valence 6 for triangle meshes).}

... how might we efficiently reorder/sort the vertex data such as to mitigate cache misses when reading the vertices of a given polygon? I'm interested in an algorithm that finishes in a reasonable time more than one that gives the best possible results. Even some rough heuristic is better than a completely arbitrary order here.

The ideal is to seek to turn something like {v1052, v507213, v63252, v3} as something more like: {v70, v71, v72, v73}. This ideal will probably often be unachievable without some outliers due to the amount of sharing of the vertices.

Full-blown solutions are certainly welcome but I'm more interested in just the name given to the family of algorithms concerned with this kind of thing of reorganizing user data at runtime for cache efficiency. I imagine such algorithms must exist, especially in the realm of efficient VBO representations of a mesh, but I have failed to come up with the proper search criteria. Would this still be called 'sorting'?

I can see that there might be two categories of ways to tackle this problem: one actually dealing with traversing mesh neighbors which would be very specific to the mesh representation and another to simply look at the general access pattern of one set of arrays indexing or pointing to entries in another. The latter appeals to me a bit more as a more generalized solution, even if it may not be as effective.

Update:

As suggested, I attempted just coding some suggested heuristics instead of speculating too much about them algorithmically. It gave me some rough base to start with and helped me understand the nature of the problem better, but unfortunately with poor results. I tagged the question as both C and C++ since I was more interested in algorithmic research suggestions and possibly snippets in those languages than providing my own implementation, but here's mine in C++ just to compare:

#define _SECURE_SCL 0
#include <iostream>
#include <vector>
#include <cmath>
#include <cassert>
#include <algorithm>
#include <deque>

using namespace std;

static const float pi_f = 3.14159265358979323846f;
enum {poly_size = 3};

class Alloc
{
public:
    Alloc(): head(0)
    {
    }

    ~Alloc()
    {
        purge();
    }

    void purge()
    {
        while (head)
        {
            Pool* next = head->next;
            free(head);
            head = next;
        }
    }

    void* allocate(int size)
    {
        size = (size + 15) & (~0x0f);
        if (head && (head->used + size) < head->capacity)
        {
            void* mem = head->mem + head->used;
            head->used += size;
            return mem;
        }
        int new_pool_size = 4096;
        if (size > new_pool_size)
            new_pool_size = size;
        Pool* new_pool = static_cast<Pool*>(malloc(sizeof(Pool) + new_pool_size));
        new_pool->used = size;
        new_pool->capacity = new_pool_size;
        new_pool->next = head;
        head = new_pool;
        return new_pool->mem;
    }

private:
    struct Pool
    {
        Pool* next;
        int used;
        int capacity;
        char mem[1];
    };
    Pool* head;
};

struct Vertex
{
    Vertex(): x(0), y(0), z(0) {}
    float x, y, z;
};

struct VertexPolys
{
    VertexPolys(): num_polys(0), polys(0) {}
    int num_polys;
    int* polys;
};

struct Poly
{
    Poly()
    {
        vertices[0] = -1;
        vertices[1] = -1;
        vertices[2] = -1;
    }
    int vertices[poly_size];
};

struct IndexSet
{
    explicit IndexSet(int n): num(0), data(n), used(n, false) {}

    void insert(int index)
    {
        if (!used[index])
        {
            data[num++] = index;
            used[index] = true;
        }
    }

    int num;
    vector<int> data;
    vector<char> used;
};

struct Mesh
{
    void reorder_vertices(const vector<int>& new_order)
    {
        assert(new_order.size() == vertices.size());
        vector<int> to_new_order(new_order.size());
        for (size_t i=0; i < new_order.size(); ++i)
            to_new_order[new_order[i]] = i;

        for (size_t i=0; i < polys.size(); ++i)
        {
            Poly& poly = polys[i];
            for (int j=0; j < poly_size; ++j)
                poly.vertices[j] = to_new_order[poly.vertices[j]];
        }
        vector<Vertex> reordered_vertices(vertices.size());
        for (size_t i=0; i < new_order.size(); ++i)
            reordered_vertices[i] = vertices[new_order[i]];
        vertices.swap(reordered_vertices);
    }

    vector<Vertex> vertices;
    vector<Poly> polys;
    vector<VertexPolys> vertex_polys;
};

static void create_sphere(Mesh* mesh, float radius, int rings, int sectors)
{
    const float ring_step = 1.0f / (float)(rings-1);
    const float side_step = 1.0f / (float)(sectors-1);
    const int total_verts = rings * sectors;

    // Create sphere vertices.
    vector<int> indices;
    indices.reserve(total_verts);
    for (int r=0; r < rings; ++r) 
    {
        for (int s=0; s < sectors; ++s) 
        {           
            indices.push_back(mesh->vertices.size());
            const float x = cos(2*pi_f * s * side_step) * sin(pi_f * r * ring_step);
            const float y = sin(-pi_f/2 + pi_f * r * ring_step);
            const float z = sin(2*pi_f * s * side_step) * sin(pi_f * r * ring_step);
            Vertex new_vertex;
            new_vertex.x = x * radius;
            new_vertex.y = y * radius;
            new_vertex.z = z * radius;
            mesh->vertices.push_back(new_vertex);
        }
    }

    // Create sphere triangles.
    for (int r=0; r < rings-1; ++r) 
    {
        for (int s=0; s < sectors-1; ++s) 
        {
            int npv[4] = 
            {
                r * sectors + s,
                r * sectors + (s+1),
                (r+1) * sectors + (s+1),
                (r+1) * sectors + s
            };
            for (int j = 0; j < 4; ++j)
                npv[j] = indices[npv[j] % total_verts];

            Poly new_poly1;
            new_poly1.vertices[0] = npv[0];
            new_poly1.vertices[1] = npv[1];
            new_poly1.vertices[2] = npv[2];
            mesh->polys.push_back(new_poly1);

            Poly new_poly2;
            new_poly2.vertices[0] = npv[2];
            new_poly2.vertices[1] = npv[3];
            new_poly2.vertices[2] = npv[0];
            mesh->polys.push_back(new_poly2);
        }
    }
}

static Mesh create_mesh(Alloc& alloc)
{
    Mesh mesh;
    create_sphere(&mesh, 10.0f, 100, 100);

    // Shuffle the vertex order to make it all random (this tends
    // to reflect a real-world model better which can get very arbitrary
    // in terms of its vertex/polygon creation order).
    vector<int> new_vertex_order(mesh.vertices.size());
    for (size_t j=0; j < mesh.vertices.size(); ++j)
        new_vertex_order[j] = static_cast<int>(j);
    random_shuffle(new_vertex_order.begin(), new_vertex_order.end());
    mesh.reorder_vertices(new_vertex_order);

    // Count the number of polygons connected to each vertex.
    mesh.vertex_polys.resize(mesh.vertices.size());
    for (size_t i=0; i < mesh.polys.size(); ++i)
    {
        const Poly& poly = mesh.polys[i];
        for (int j=0; j < poly_size; ++j)
        {
            const int vertex_index = poly.vertices[j];
            ++mesh.vertex_polys[vertex_index].num_polys;
        }
    }

    // Allocate space for the vertex polygons.
    for (size_t i=0; i < mesh.vertex_polys.size(); ++i)
    {
        VertexPolys& vp = mesh.vertex_polys[i];
        vp.polys = static_cast<int*>(alloc.allocate(vp.num_polys * sizeof(int)));
        vp.num_polys = 0;
    }

    // Form the polygons connected to each vertex.
    for (size_t i=0; i < mesh.polys.size(); ++i)
    {
        const Poly& poly = mesh.polys[i];
        for (int j=0; j < poly_size; ++j)
        {
            const int vertex_index = poly.vertices[j];
            VertexPolys& vp = mesh.vertex_polys[vertex_index];
            vp.polys[vp.num_polys++] = i;
        }
    }
    return mesh;
}

static void output_stats(const Mesh& mesh, const char* type)
{
    // Measure the proximity of each pair of vertices in each polygon.
    int num_optimal = 0;
    float prox_sum = 0.0f;
    for (size_t i=0; i < mesh.polys.size(); ++i)
    {
        const Poly& poly = mesh.polys[i];
        const int prox1 = abs(poly.vertices[1] - poly.vertices[0]);
        const int prox2 = abs(poly.vertices[2] - poly.vertices[1]);
        const int prox3 = abs(poly.vertices[0] - poly.vertices[2]);
        const float poly_prox = (prox1 + prox2 + prox3) / 3.0f;
        prox_sum += poly_prox;
        if (prox1 <= 6 && prox2 <= 6 && prox3 <= 6)
            ++num_optimal;
    }
    cout << "-------------------------------------------------" << endl;
    cout << type << endl;
    cout << "-------------------------------------------------" << endl;
    cout << "-- # Vertices: " << mesh.vertices.size() << endl;
    cout << "-- # Polygons: " << mesh.polys.size() << endl;
    cout << "-- # Optimal Polygons: " << num_optimal << endl;
    cout << "-- Average vertex proximity: " << (prox_sum / mesh.polys.size()) << endl;
}

static void basic_optimization(Mesh& mesh)
{
    IndexSet index_set(static_cast<int>(mesh.vertices.size()));
    for (size_t i=0; i < mesh.polys.size(); ++i)
    {
        const Poly& poly = mesh.polys[i];
        for (int j=0; j < poly_size; ++j)
        {
            const int vertex_index = poly.vertices[j];
            index_set.insert(vertex_index);
        }
    }
    mesh.reorder_vertices(index_set.data);
}

static void breadth_first_optimization(Mesh& mesh)
{
    IndexSet index_set(static_cast<int>(mesh.vertices.size()));
    assert(!mesh.polys.empty());

    deque<int> to_process;
    vector<char> processed(mesh.polys.size(), false);
    to_process.push_back(0);
    processed[0] = true;

    while (!to_process.empty())
    {
        const int poly_index = to_process.front();
        to_process.pop_front();

        const Poly& poly = mesh.polys[poly_index];
        for (int i=0; i < poly_size; ++i)
        {
            const int vertex_index = poly.vertices[i];
            index_set.insert(vertex_index);
            const VertexPolys& vp = mesh.vertex_polys[vertex_index];
            for (int j=0; j < vp.num_polys; ++j)
            {
                if (!processed[vp.polys[j]])
                {
                    to_process.push_back(vp.polys[j]);
                    processed[vp.polys[j]] = true;
                }
            }
        }
    }
    mesh.reorder_vertices(index_set.data);
}

int main()
{
    // Linear Heuristic
    {
        Alloc alloc;
        Mesh mesh = create_mesh(alloc);
        basic_optimization(mesh);
        output_stats(mesh, "Linear Heuristic");
    }

    // Breadth-First Heuristic
    {
        Alloc alloc;
        Mesh mesh = create_mesh(alloc);
        breadth_first_optimization(mesh);
        output_stats(mesh, "Breadth-First Graph Heuristic");
    }
}

Results:

-------------------------------------------------
Linear Heuristic
-------------------------------------------------
-- # Vertices: 10000
-- # Polygons: 19602
-- # Optimal Polygons: 198
-- Average vertex proximity: 67.0118
-------------------------------------------------
Breadth-First Graph Heuristic
-------------------------------------------------
-- # Vertices: 10000
-- # Polygons: 19602
-- # Optimal Polygons: 13
-- Average vertex proximity: 88.9976

The nature of the problem seems to boil down to mapping this 2D, highly-connected topological mesh grid into a 1-dimensional space with good locality. It's why I think these basic algorithms don't fare too well: the connectivity and seamlessness of the mesh makes it impossible to map the mesh very effectively by a heuristic that traverses it in its entirety.

It's similar to texture atlas mapping where we're trying to take the 3D mesh and 'unfold' it and map it into a flat 2D space. To even solve the problem, we often need to introduce seams into the mesh so that we can break it down into 2D islands of connectivity. Here we're similarly going from 2D graph space to break it down into islands with good locality in 1D memory space.

So I think an optimal algorithm would break the mesh down into segments (sections) this way and apply these types of algorithms on each segment. This way we get good 1D memory locality within a given segment of the mesh, with the pathological cases being the outliers at the boundaries of each segment where the vertices are shared with other segments.

A variation of the breadth-first search could do this crudely if it only traversed to a breadth of K for a given centroid polygon, P, where K is some small number, before choosing a completely different P. That would isolate small connected 2D islands in a mesh that can be mapped to a 1-dimensional memory space with good locality within each island (just poor at the boundaries of those islands).

Anyway, I was mainly looking for formal algorithmic names for such a thing. I'm still not sure if this would be called 'sorting' -- perhaps I'm searching for the wrong thing.

An obvious "rough heuristic" would be to just run through your polygons once, placing their vertices together. This will be even be optimal if no polygons share vertices. — Jim Balter, May 05 '15 at 01:13
@JimBalter The sharing part was what got me. Every vertex is often shared by an average of 4-6 polygons (6 for a triangle mesh with an average vertex valence of 6). So I thought a single pass heuristic there might not fare so well with distant polygons in memory "fighting" each other for the vertices. — , May 05 '15 at 01:15
@JimBalter What do you think if we first did a pass trying to put polygons that have shared vertices next to each other (ex: with a spatial partioning structure), then another on the vertices? I imagine some outliers are required -- pathological cases since the problem sounds NP-hardish, but if most polygons have their vertices fit the cache, that'd be great. — , May 05 '15 at 01:17
At least the polygons that get their vertices together won't be "fighting" themselves. Do less speculation and more measurement. — Jim Balter, May 05 '15 at 01:27
Can you use Cuthill-McKee to reorder the verticies? I don't know if this would improve cache hits, but it would at least group nearby verticies physically closer in memory. — Carlton, May 05 '15 at 01:28
@JimBalter Apologies, maybe I'm looking at it wrong but say you have a polygon, p1: {v1, v2, v3, v4} and p1000000: {v1, v783252, ...}. What do you do here if you are only sorting vertices with a linear pass heuristic? — , May 05 '15 at 01:31
Depending on whether you move vertices already placed, you get either v1, v2, v3, v4, v783252 or v2, v3, v4, v1, v783252 ... either the first polygon or the last polygon will have all of its vertices together. — Jim Balter, May 05 '15 at 01:36
@JimBalter I see -- I will try that. I thought the amount of contention given 6 triangles to a vertex might really skew the results towards the beginning or the end. But I'll try to construct a test and some kind of measurement of its quality -- we can see from there. I'm looking at large-scale cases of meshes that often span in the tens of millions of vertices so I might have let that speculation go a bit too wild. — , May 05 '15 at 01:38
@Carlton I was thinking something along those lines. I have a segmentation algorithm already in place used for other places to break a mesh down into N segments. Even though the segments can share vertices at their boundaries, we can 'pretend' that they don't, perhaps applying Jim's method to each segment, and that seems like it would allow better results with the outlier vertices being the pathological cases. The segmentation would force the mesh to be partitioned into sections where it vertices and polygons are close together and not too numerous (thousands, not millions). — , May 05 '15 at 01:44
Carlton suggested Cuthill-McKee, which is a form of breadth-first search ... something along those lines would get better locality than just a linear pass. e.g., place the vertices of some polygon, then for each vertex, if any, shared with another polygon, place the vertices of that polygon, etc., until you've handled all the polygons. — Jim Balter, May 05 '15 at 01:44
The breadth-first search sounds quite nice -- I think I was tripping because I was visualizing the vertex pass heuristic as having a similar effect to the depth-first one where we can find that we've gone very deep and come back to a polygon which shares a vertex at the bottom of the stack (if we imagine a recursive implementation for simplicity). That said, I'll stop theorizing now and get to the code experimentation! — , May 05 '15 at 01:54
I ended up trying both ways -- the linear pass and breadth-first search and the results were a bit surprising. I used a basic 'average goodness' metric which simply measures the proximity of the polygon point indices with a polygon considered fairly 'optimal' if no two points are more than 4 indices apart. The interesting thing is that the linear pass method actually fared better than the breadth-first search, but both give similar results. Linear pass: makes 2% polygons 'optimal', and for a 10k mesh, has an average vertex proximity of 67 (on average, two poly vertices are 67 indices apart). — , May 05 '15 at 13:35
The breadth-first search makes about 0.1% of the polygons 'optimal', and for a 10k mesh, has an average vertex proximity of 88. Both unfortunately leave the polygon vertices still widely dispersed across the mesh. I really think this is an NP-hard problem because we're trying to map this shared mesh topology which is like a 2D structure into a 1-dimensional space, tightly packed into local regions. So I see it somewhat like texture atlas mapping a mesh -- taking it from 3-dimensions to 2, and works best if we have 'seam areas' to slice that data to unwrap and flatten it. — , May 05 '15 at 13:37
A variation of the breadth-first search might work if we take a given centroid polygon, P, and don't traverse it beyond a breadth of K (where K is some small number). That would map P and its nearby neighbors into a 1-dimensional memory space with good locality. Then we stop and choose another P (perhaps simply a random one for a quick heuristic). — , May 05 '15 at 13:43
Sometimes data denormalization is the way to go, certainly for static objects, i.e. where the potentially shared vertices don't change. The "waste" of memory is only a waste if you're memory-constrained. Otherwise, careful denormalization can yield purely streaming behavior where you're accessing sequential data, at least at the input to some processing stage. — Kuba hasn't forgotten Monica, Nov 13 '20 at 15:40

score 4 · Accepted Answer · 2015-11-17T14:21:04.960

I feel a little goofy answering my own question and accepting it as an answer, and appreciate all the comments provided!

But I found a class of algorithms and papers dealing with this very problem starting with Tom Forsynth's Linear-Speed Vertex Cache Optimization as an often-cited heuristic. Curiously they often tackle these problems by actually simulating the hardware cache. Here is one such example: http://www-etud.iro.umontreal.ca/~blancher/projects/vertex_caching/vertex_caching_eb.pdf

The problem is apparently NP-complete but there seem to be a number of practical heuristics available.

Such techniques can apparently improve rendering speeds by quite a large margin, and should be applicable to general mesh traversal algorithms (with or without the involvement of the GPU and inside or outside a rendering context).

I ended up kind of getting around the problem rather than solving it by segmenting meshes into smaller connected regions (improving locality of reference with small regions of tightly-packed poly/vert/edge data rather than improving the order of the indexing within a larger region). But I'd like to try these techniques out in the near future, as I had to go through many hurdles to try to get all these little segmented regions to stitch up seamlessly for the user, and it would be much easier to just have one big mesh with a more cache-friendly memory layout when accessing polygon vertices or vertex edges or edge polygons, e.g.

Since there weren't any answers yet, I thought I'd offer my discoveries for any curious.

Meshes: "Sorting/Reordering" Arrays Referencing Shared Entries of Another for Cache Efficiency

Update:

1 Answers1