Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

Question

Is there any possible optimization for random access on a very big array (I currently use uint8_t, and I'm asking about what's better)

uint8_t MyArray[10000000];

when the value at any position in the array is

0 or 1 for 95% of all cases,
2 in 4% of cases,
between 3 and 255 in the other 1% of cases?

So, is there anything better than a uint8_t array to use for this? It should be as quick as possible to loop over the whole array in a random order, and this is very heavy on RAM bandwidth, so when having more than a few threads doing that at the same time for different arrays, currently the whole RAM bandwidth is quickly saturated.

I'm asking since it feels very inefficient to have such a big array (10 MB) when it's actually known that almost all values, apart from 5%, will be either 0 or 1. So when 95% of all values in the array would only actually need 1 bit instead of 8 bit, this would reduce memory usage by almost an order of magnitude. It feels like there has to be a more memory efficient solution that would greatly reduce RAM bandwidth required for this, and as a result also be significantly quicker for random access.

It depends. Compression/decompression takes CPU, so please make your question more specific (include your actual code) so we can know which optimization worth applying. Otherwise answers would be pure guesses. — user202729, May 14 '18 at 05:26
Two bits (0 / 1 / see hashtable) and a hashtable for the values bigger than 1? — user253751, May 14 '18 at 05:30
@user202729 On what does it depend? I think this is something that's an interesting question for anyone who has to do something similar like I do, so I would like to see more of a universal solution for this, not an answer that's super specific to my code. If it depends on something, it would be good to have an answer explaining what it depends on so that everyone reading it can understand if there is a better solution for his own case. — JohnAl, May 14 '18 at 05:44
@immibis I did also think about that, but the problem basically is, since memory access in RAM always happens aligned (I think?), is there actually any benefit to accessing an array of 2 bits in a random order? — JohnAl, May 14 '18 at 05:45
Essentially, what you're asking about is called [*sparsity*](https://en.wikipedia.org/wiki/Sparse_matrix). — Mateen Ulhaq, May 14 '18 at 06:01
Instead of full-blown compression, you could come up with an encoding that can be quickly decoded. For instance, "68-0, 1-1, 105-0, 1-2, ..." — Mateen Ulhaq, May 14 '18 at 06:02
Needs more information... Why is the access random, and do the non-zero values follow a pattern? — Ext3h, May 14 '18 at 06:51
@Ext3h The values >2 follow no pattern, and why the access is random is hard to describe. — JohnAl, May 14 '18 at 06:55
Various related questions: [Stackoverflow search for lookup sparse array](https://stackoverflow.com/search?q=lookup+sparse+array) — JollyJoker, May 14 '18 at 10:24
As an aside, I'm mildly surprised you don't get a stack overflow when declaring such a large array as a local variable. — , May 14 '18 at 15:16
Important preliminary question -- what performance do you get if you switch to an array where all of the values are *actually* 0 or 1, and use your favorite method for that? This preliminary test should help give a suggestion of the best performance you can possibly hope to attain... which might be "you can't do better". — , May 14 '18 at 15:21
@JohnAl What do you mean by "to loop over the whole array in a random order"? _Random access_ usually means that you access one arbitrary position, or a few positions in arbitrary order. Do you really mean something like _random shuffle_? — Pablo H, May 14 '18 at 17:38
Is the array read-only or often modified? Can you afford a precomputation step? If it's read-only and one can do precomputation on it, you may be able to trade computation for memory and eliminate the LUT entirely using techniques related to perfect hashing, or get a much smaller LUT with a hybrid of hashing and lookups. — Iwillnotexist Idonotexist, May 14 '18 at 19:56
@JohnAl If you really want to loop over this in random order (as opposed to random access), you can be very efficient b/c the order doesn't matter. Use an array of length 256 where the value in position i is the number of i's. Select randomly from this giving appropriate weight to each cell, and decrement the selected cell. — Dave, May 14 '18 at 20:44
@JohnAl "The values >2 follow no pattern", so you have a fixed pattern for 99% of the cases? Can you tell whether the value is 0,1,2 or other just by examining the index? — JollyJoker, May 15 '18 at 07:53
@PabloH I mean that I access all array positions in a random order — JohnAl, May 15 '18 at 08:11
@IwillnotexistIdonotexist A precomputation step would be fine, but the array should still be modified from time to time, so the precomputation step shouldn't be too expensive. — JohnAl, May 15 '18 at 08:12
"It should be as quick as possible" and "it feels very inefficient" - it seems to me that you *need* the former but only wonder about whether you *should* fix the latter. If that's the case, stick with you 10M array. — paxdiablo, May 16 '18 at 04:53
Note also that cache locality may be relevant if speed is a bottleneck. — Thorbjørn Ravn Andersen, May 16 '18 at 14:17
Any threshold on what percent of accesses are writes? per @IwillnotexistIdonotexist's comment. Are writes contiguous, random-access, bursty? Is the address of a write event related to the preceding read event? (e.g. constraint on locality) Anything you can statistically tell us about that helps optimize the data structure. — smci, May 17 '18 at 02:32
*"Values both <2 and >2 follow no pattern... I access all array positions in a random order"*. This does not make any sense and does not seem like a real question (unless it's cache-busting code). What purpose is your code serving, other than to be statistically indescribable? — smci, May 17 '18 at 02:39

score 155 · Accepted Answer · edited May 15 '18 at 20:39

A simple possibility that comes to mind is to keep a compressed array of 2 bits per value for the common cases, and a separated 4 byte per value (24 bit for original element index, 8 bit for actual value, so (idx << 8) | value)) sorted array for the other ones.

When you look up a value, you first do a lookup in the 2bpp array (O(1)); if you find 0, 1 or 2 it's the value you want; if you find 3 it means that you have to look it up in the secondary array. Here you'll perform a binary search to look for the index of your interest left-shifted by 8 (O(log(n) with a small n, as this should be the 1%), and extract the value from the 4-byte thingie.

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

For an array such as the one you proposed, this should take 10000000 / 4 = 2500000 bytes for the first array, plus 10000000 * 1% * 4 B = 400000 bytes for the second array; hence 2900000 bytes, i.e. less than one third of the original array, and the most used portion is all kept together in memory, which should be good for caching (it may even fit L3).

If you need more than 24-bit addressing, you'll have to tweak the "secondary storage"; a trivial way to extend it is to have a 256 element pointer array to switch over the top 8 bits of the index and forward to a 24-bit indexed sorted array as above.

Quick benchmark

#include <algorithm>
#include <vector>
#include <stdint.h>
#include <chrono>
#include <stdio.h>
#include <math.h>

using namespace std::chrono;

/// XorShift32 generator; extremely fast, 2^32-1 period, way better quality
/// than LCG but fail some test suites
struct XorShift32 {
    /// This stuff allows to use this class wherever a library function
    /// requires a UniformRandomBitGenerator (e.g. std::shuffle)
    typedef uint32_t result_type;
    static uint32_t min() { return 1; }
    static uint32_t max() { return uint32_t(-1); }

    /// PRNG state
    uint32_t y;

    /// Initializes with seed
    XorShift32(uint32_t seed = 0) : y(seed) {
        if(y == 0) y = 2463534242UL;
    }

    /// Returns a value in the range [1, 1<<32)
    uint32_t operator()() {
        y ^= (y<<13);
        y ^= (y>>17);
        y ^= (y<<15);
        return y;
    }

    /// Returns a value in the range [0, limit); this conforms to the RandomFunc
    /// requirements for std::random_shuffle
    uint32_t operator()(uint32_t limit) {
        return (*this)()%limit;
    }
};

struct mean_variance {
    double rmean = 0.;
    double rvariance = 0.;
    int count = 0;

    void operator()(double x) {
        ++count;
        double ormean = rmean;
        rmean     += (x-rmean)/count;
        rvariance += (x-ormean)*(x-rmean);
    }

    double mean()     const { return rmean; }
    double variance() const { return rvariance/(count-1); }
    double stddev()   const { return std::sqrt(variance()); }
};

std::vector<uint8_t> main_arr;
std::vector<uint32_t> sec_arr;

uint8_t lookup(unsigned idx) {
    // extract the 2 bits of our interest from the main array
    uint8_t v = (main_arr[idx>>2]>>(2*(idx&3)))&3;
    // usual (likely) case: value between 0 and 2
    if(v != 3) return v;
    // bad case: lookup the index<<8 in the secondary array
    // lower_bound finds the first >=, so we don't need to mask out the value
    auto ptr = std::lower_bound(sec_arr.begin(), sec_arr.end(), idx<<8);
#ifdef _DEBUG
    // some coherency checks
    if(ptr == sec_arr.end()) std::abort();
    if((*ptr >> 8) != idx) std::abort();
#endif
    // extract our 8-bit value from the 32 bit (index, value) thingie
    return (*ptr) & 0xff;
}

void populate(uint8_t *source, size_t size) {
    main_arr.clear(); sec_arr.clear();
    // size the main storage (round up)
    main_arr.resize((size+3)/4);
    for(size_t idx = 0; idx < size; ++idx) {
        uint8_t in = source[idx];
        uint8_t &target = main_arr[idx>>2];
        // if the input doesn't fit, cap to 3 and put in secondary storage
        if(in >= 3) {
            // top 24 bits: index; low 8 bit: value
            sec_arr.push_back((idx << 8) | in);
            in = 3;
        }
        // store in the target according to the position
        target |= in << ((idx & 3)*2);
    }
}

volatile unsigned out;

int main() {
    XorShift32 xs;
    std::vector<uint8_t> vec;
    int size = 10000000;
    for(int i = 0; i<size; ++i) {
        uint32_t v = xs();
        if(v < 1825361101)      v = 0; // 42.5%
        else if(v < 4080218931) v = 1; // 95.0%
        else if(v < 4252017623) v = 2; // 99.0%
        else {
            while((v & 0xff) < 3) v = xs();
        }
        vec.push_back(v);
    }
    populate(vec.data(), vec.size());
    mean_variance lk_t, arr_t;
    for(int i = 0; i<50; ++i) {
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += lookup(xs() % size);
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "lookup: %10d µs\n", dur);
            lk_t(dur);
        }
        {
            unsigned o = 0;
            auto beg = high_resolution_clock::now();
            for(int i = 0; i < size; ++i) {
                o += vec[xs() % size];
            }
            out += o;
            int dur = (high_resolution_clock::now()-beg)/microseconds(1);
            fprintf(stderr, "array:  %10d µs\n", dur);
            arr_t(dur);
        }
    }

    fprintf(stderr, " lookup |   ±  |  array  |   ±  | speedup\n");
    printf("%7.0f | %4.0f | %7.0f | %4.0f | %0.2f\n",
            lk_t.mean(), lk_t.stddev(),
            arr_t.mean(), arr_t.stddev(),
            arr_t.mean()/lk_t.mean());
    return 0;
}

(code and data always updated in my Bitbucket)

The code above populates a 10M element array with random data distributed as OP specified in their post, initializes my data structure and then:

performs a random lookup of 10M elements with my data structure
does the same through the original array.

(notice that in case of sequential lookup the array always wins by a huge measure, as it's the most cache-friendly lookup you can do)

These last two blocks are repeated 50 times and timed; at the end, the mean and standard deviation for each type of lookup are calculated and printed, along with the speedup (lookup_mean/array_mean).

I compiled the code above with g++ 5.4.0 (-O3 -static, plus some warnings) on Ubuntu 16.04, and ran it on some machines; most of them are running Ubuntu 16.04, some some older Linux, some some newer Linux. I don't think the OS should be relevant at all in this case.

            CPU           |  cache   |  lookup (µs)   |     array (µs)  | speedup (x)
Xeon E5-1650 v3 @ 3.50GHz | 15360 KB |  60011 ±  3667 |   29313 ±  2137 | 0.49
Xeon E5-2697 v3 @ 2.60GHz | 35840 KB |  66571 ±  7477 |   33197 ±  3619 | 0.50
Celeron G1610T  @ 2.30GHz |  2048 KB | 172090 ±   629 |  162328 ±   326 | 0.94
Core i3-3220T   @ 2.80GHz |  3072 KB | 111025 ±  5507 |  114415 ±  2528 | 1.03
Core i5-7200U   @ 2.50GHz |  3072 KB |  92447 ±  1494 |   95249 ±  1134 | 1.03
Xeon X3430      @ 2.40GHz |  8192 KB | 111303 ±   936 |  127647 ±  1503 | 1.15
Core i7 920     @ 2.67GHz |  8192 KB | 123161 ± 35113 |  156068 ± 45355 | 1.27
Xeon X5650      @ 2.67GHz | 12288 KB | 106015 ±  5364 |  140335 ±  6739 | 1.32
Core i7 870     @ 2.93GHz |  8192 KB |  77986 ±   429 |  106040 ±  1043 | 1.36
Core i7-6700    @ 3.40GHz |  8192 KB |  47854 ±   573 |   66893 ±  1367 | 1.40
Core i3-4150    @ 3.50GHz |  3072 KB |  76162 ±   983 |  113265 ±   239 | 1.49
Xeon X5650      @ 2.67GHz | 12288 KB | 101384 ±   796 |  152720 ±  2440 | 1.51
Core i7-3770T   @ 2.50GHz |  8192 KB |  69551 ±  1961 |  128929 ±  2631 | 1.85

The results are... mixed!

In general, on most of these machines there is some kind of speedup, or at least they are on a par.
The two cases where the array truly trumps the "smart structure" lookup are on a machines with lots of cache and not particularly busy: the Xeon E5-1650 above (15 MB cache) is a night build machine, at the moment quite idle; the Xeon E5-2697 (35 MB cache) is a machine for high performance calculations, in an idle moment as well. It does make sense, the original array fits completely in their huge cache, so the compact data structure only adds complexity.
At the opposite side of the "performance spectrum" - but where again the array is slightly faster, there's the humble Celeron that powers my NAS; it has so little cache that neither the array nor the "smart structure" fits in it at all. Other machines with cache small enough perform similarly.
The Xeon X5650 must be taken with some caution - they are virtual machines on a quite busy dual-socket virtual machine server; it may well be that, although nominally it has a decent amount of cache, during the time of the test it gets preempted by completely unrelated virtual machines several times.

If the larger values (`>2`) are nearly equally distributed across the array, you can significantly speed up the search by looking up `i*small_array_size/large_array_size` first. — cmaster - reinstate monica, May 14 '18 at 06:32
I do sometimes have to change individual values in the array, but only rarely. O(n) for that is quite bad, so I'll have to see if theres enough of a performance benefit to be worth it. I will profile all mentioned solutions of course (and my original one) and see which one is the fastest. @MartinBonner yes, of course `std::sort` , but I'm wondering if I should create a struct to use as the type for `sec_arr` and override the < operator for that, or define the sorting differently? I never really had to use std::sort much yet. — JohnAl, May 14 '18 at 09:45
@JohnAl You don't need a struct. A `uint32_t` will be fine. Erasing an element from the secondary buffer will obviously leave it sorted. Inserting an element can be done with `std::lower_bound` and then `insert` (rather than appending and re-sorting the whole thing). Updates make the full-size secondary array much more attractive - I'd certainly start with that. — Martin Bonner supports Monica, May 14 '18 at 09:51
@MartinBonner only the first 24 bit of the `uint32_t` are allowed to be used for the sorting though, so I can't just sort the `uint32_t` array as if the values would be "real" `uint32_t`, so thats why I asked. I would use a struct with a bitfield and compare the first 24 bit of the bitfield then in the `<` overriden operator, and if theres a "cleaner" way I would like to hear about that. — JohnAl, May 14 '18 at 09:55
@JohnAl Because the value is `(idx << 8) + val` you don't have to worry about the value portion - just use a straight compare. It will *always* compare less than `((idx+1) << 8) + val` and less than `((idx-1) << 8) + val` — Martin Bonner supports Monica, May 14 '18 at 09:57
@MartinBonner oh, I didn't think about that! that makes it easy of course. Thanks! :) — JohnAl, May 14 '18 at 10:14
ahem! My second "less than" should be "greater than" of course. — Martin Bonner supports Monica, May 14 '18 at 10:19
@JohnAl: if that may be useful, I added a `populate` function that should populate `main_arr` and `sec_arr` according to the format that `lookup` expects. I didn't actually try it, so don't expect it to *really* work correctly :-) ; anyhow, it should give you the general idea. — Matteo Italia, May 14 '18 at 10:44
@MatteoItalia Very useful, thanks! I haven't tested the populate function yet, but I have tested `lookup` and used it to calculate the sum of the array. I compared it with a "regular" array, and the sum is same, so seems to work fine. Also did some profiling (random access, both same): `Array Size: 1M -- Regular Array: 3.9 ms -- lookup function: 6 ms` `Array Size: 10M -- Regular Array: 44 ms -- lookup function: 75 ms` The lookup becomes a lot faster with an array size >500M, but since the 24 bit index is only valid until 16M I think testing with higher values doesn't make sense, right? — JohnAl, May 14 '18 at 11:50
@MatteoItalia Most interesting: `Array Size: 100K -- Regular Array: 1.58 ms -- lookup function: 0.633 ms` So the `lookup` function seems to work very well with small values actually, but less good with higher values? Kinda the opposite of what I expected. — JohnAl, May 14 '18 at 11:53
@JohnAl: that's interesting, I would have expected the opposite, although the very small case may make the difference between L2 and L3 cache; I'll do some tests as well. For the 500 MB array, yes, it cannot be used as is because of the 24 bit index. — Matteo Italia, May 14 '18 at 11:54
@MatteoItalia That's the code I used for testing: pastebin.com/L5wWF6a6 — JohnAl, May 14 '18 at 12:03
@JohnAl: with my code and a 10M array I obtain consistently twice the speed with my lookup when performing random access, while plain array access is 10 times faster when performing ordered lookup (as expected). I'll post my code ASAP. — Matteo Italia, May 14 '18 at 12:25
https://ideone.com/YVSgjv over on ideone it's a bit more nuanced - it's 7 vs 10, but still faster (but IDK what compile options they are using - once on Ideone I even got a 32 bit machine); on smaller arrays regular array lookup wins as expected. — Matteo Italia, May 14 '18 at 12:39
@JohnAl in your code you are killing the cache by looping through the `rand_access_order` array! You should generate the indexes for the lookup "online". — Matteo Italia, May 14 '18 at 12:45
@MatteoItalia Ah, you're right! Generating the index "on the fly" does improve the performance on the `lookup` function significantly in comparison, but still only it being 0-10% faster compared to regular array with this code, by far not twice as fast: https://pastebin.com/ntMGRN57 I'm using an i7 5820k, I will try your code now. — JohnAl, May 14 '18 at 12:58
@JohnAl: heh, with an i7 5820k you have 15 MB of cache; the whole array fits, so I wouldn't really expect `lookup` to beat straight array indexing. — Matteo Italia, May 14 '18 at 13:02
@MatteoItalia With your code, I'm seeing quite big fluctuations in the results: https://pastebin.com/riA9Wbyp Mostly similar to my test code, but sometimes the `lookup` is also significantly faster. Probably because I have significantly more overhead than you in my test. In the end, what matters is what happens when 10 threads at the same time run this (with different arrays), then my 15 MB of cache also won't help much any more, so I'll have to do some more tests with threading :) — JohnAl, May 14 '18 at 13:19
@MatteoItalia I've added some simple threading to your code to let it run in 8 different threads, each with their completely own data, thats the code: https://pastebin.com/fncR5VZV And I do notice that it all becomes way slower (roughly 2.5 times the amount of time), and the difference between array and lookup becomes even less unfortunately. Feels more like they are exactly same fast now. I am testing them only individually to not make one take more RAM bandwidth than the other one. Shouldn't the difference in RAM bandwidth become more noticeable with more threads? — JohnAl, May 14 '18 at 13:53
I'm giving this +1 just for the benchmarking. Nice to see on a question about efficiency and with results for multiple processor types too! Nice! — Jack Aidley, May 14 '18 at 14:02
Regarding my threading code, I should probably split up the generation of the array and the lookup, and not measure the lookup in threads while other threads currently use up RAM bandwidth and cache with the filling of the array. Maybe that explains the results I see. — JohnAl, May 14 '18 at 14:13
@JohnAI You should profile it for your actual use case and nothing else. White room speed doesn't matter. — Jack Aidley, May 14 '18 at 15:12
That's a nice benchmark, but could you add some details about compiler, compilation options, and OS? Besides, a hash table could do better than a binary search here, at least with a hash function well fitted for the data. — Frax, May 14 '18 at 17:58
@Frax: I made the benchmark a bit more systematic, added OS & compiler information and tested on other machines; you can find the results in the table above. Yep, a hash table may be a better idea, but I don't have much time to implement it now. Feel free to try it and report the results! — Matteo Italia, May 14 '18 at 20:43
@MatteoItalia I have tested now with only measuring time while the threads are busy with the access of the array (after all threads finished populating the array) and I seem to see a constant result of `Array: 180 ms - Lookup: 130 ms` with 6 threads. With 1 thread its `Array: 57 ms - Lookup: 56 ms`. I'm always testing them separately, testing array/lookup simultaniously doesn't make sense since then one can steal cache from the other. So with 6 threads, the lookup seems to be 38% quicker. My code: https://pastebin.com/XhBrk7iy Have you tested anything with threads? — JohnAl, May 15 '18 at 03:38
@JohnAI not yet, will do it tonight. But in your real use case is it going to be each thread with its own data or all threads looking up in a shared data structure? — Matteo Italia, May 15 '18 at 05:42
@MatteoItalia all threads work on their own data, so data is not shared between threads. — JohnAl, May 15 '18 at 08:06
Maybe I'm missing something, but this answer seems to address mostly space usage, while the OP seems to be concerned mostly about bandwidth (i.e. time). — AnoE, May 15 '18 at 12:49
@AnoE: AFAICT the only time optimization you can do over a random access array of usually-small elements (which in a "classic", all equally fast-memory model would be the fastest way to access them) is to squeeze it so that it will fit in cache, thus gaining speed by climbing the memory hierarchy. — Matteo Italia, May 15 '18 at 13:20
@MatteoItalia, would you mind adding a sentence to your answer to point out that it only makes sense to compress it this way (i.e., by a constant factor which is not dependent on the actual distribution of the data, like the approach you used) if the end results fits in the cache(s)? Only OP knows how large his data (and his cache) actually is; and it would help him deciding. That said, there is more space efficient compression (again, depending on the data => hash tables etc.), which might be in order if the above is not the case. — AnoE, May 15 '18 at 13:36
@AnoE: I can add it, but it's pretty much implicit in the question - OP already clearly states (1) how big the data is (2) its distribution (3) that his bottleneck is RAM bandwith. Also, the approach is *very* dependent from the distribution of data as detailed by OP: it uses a primary fast (O(1) random access) and compact (2 bit per data, quite good for the given distribution) for the most frequent data, and a slower (O(log n)) and bigger (4 byte per element) storage for the rest. A hash table has to store the index and value anyway, so at best it can be used as a better secondary storage. — Matteo Italia, May 15 '18 at 13:56
Using a small array to handle the 0/1 cases is going to be helpful, but I would think it best to handle the >1 case using a straight array. Fetching a byte from the straight array will typically result in one cache miss, and will at worst result in two (if it displaces something that would otherwise have been useful). I don't think the binary search is going to result in O(lgN) cache misses, but wouldn't expect it to work brilliantly. — supercat, May 15 '18 at 22:01
@MatteoItalia: Ideas to test: does `std::vector` offer any improvement for the small array? (Probably not) Also, right now you're storing 16 "small values" per 32 bits. You can actually cram 20 per 32 bits by using base 3. (0,1,lookup). This appears to scale down to 5 values per byte safely. This may help the tiny-cache machines, at the expense of the large cache machines. — Mooing Duck, May 17 '18 at 05:58
@MooingDuck: `vector` is probably going to give only overhead - especially because I'd have to combine two values to obtain a single 0-3; trinary encoding is something I was thinking about - I even have some code lying around that performs the encoding/decoding for 16 bit words without using divisions (it was used once in a product where we needed to compactly transmit -1/0/+1 values), but I couldn't find the time to benchmark it - and I still have to write/run the multithread benchmark *and* one for the simpler solution (with the original array as secondary storage)! — Matteo Italia, May 17 '18 at 06:27
TBH I've been taken a bit by surprise by the reception of this answer - it really started just as an idle idea wrote down while on the train, and now I'm struggling to find time to write benchmarks to be run on every pc/server I have access to . #StackOverflowProblems — Matteo Italia, May 17 '18 at 06:31

6502 · Answer 2 · 2018-05-16T17:32:28.053

33

Another option could be

check if the result is 0, 1 or 2
if not do a regular lookup

In other words something like:

unsigned char lookup(int index) {
    int code = (bmap[index>>2]>>(2*(index&3)))&3;
    if (code != 3) return code;
    return full_array[index];
}

where bmap uses 2 bits per element with the value 3 meaning "other".

This structure is trivial to update, uses 25% more memory but the big part is looked up only in 5% of the cases. Of course, as usual, if it's a good idea or not depends on a lot of other conditions so the only answer is experimenting with real usage.

edited May 16 '18 at 17:32

answered May 14 '18 at 06:57

6502

112,025
15
165
265

4

I'd say that's a good compromise to get as many cache hits as possible (since the reduced structure can fit in the cache more easily), without losing much on random access time. – meneldal May 14 '18 at 07:18
I think this can be further improved. I have had success in the past with a similar but different problem where exploiting branch predicition helped a lot. It may help to split the `if(code != 3) return code;` into `if(code == 0) return 0; if(code==1) return 1; if(code == 2) return 2;` – kutschkem May 16 '18 at 12:51
@kutschkem: in that case, `__builtin_expect` & co or PGO can also help. – Matteo Italia May 19 '18 at 21:00

score 23 · Answer 3 · answered May 14 '18 at 06:23

This is more of a "long comment" than a concrete answer

Unless your data is something that is something well-known, I doubt anyone can DIRECTLY answer your question (and I'm not aware of anything that matches your description, but then I don't know EVERYTHING about all kinds of data patterns for all kinds of use-cases). Sparse data is a common problem in high performance computing, but it's typically "we have a very large array, but only some values are non-zero".

For not well known patterns like what I think yours is, nobody will KNOW directly which is better, and it depends on the details: how random is the random access - is the system accessing clusters of data items, or is it completely random like from a uniform random number generator. Is the table data completely random, or are there sequences of 0 then sequences of 1, with a scattering of other values? Run length encoding would work well if you have reasonably long sequences of 0 and 1, but won't work if you have "checkerboard of 0/1". Also, you'd have to keep a table of "starting points", so you can work your way to the relevant place reasonably quickly.

I know from a long time back that some big databases are just a large table in RAM (telephone exchange subscriber data in this example), and one of the problems there is that caches and page-table optimisations in the processor is pretty useless. The caller is so rarely the same as one recently calling someone, that there is no pre-loaded data of any kind, it's just purely random. Big page-tables is the best optimisation for that type of access.

In a lot of cases, compromising between "speed and small size" is one of those things you have to pick between in software engineering [in other engineering it's not necessarily so much of a compromise]. So, "wasting memory for simpler code" is quite often the preferred choice. In this sense, the "simple" solution is quite likely better for speed, but if you have "better" use for the RAM, then optimising for size of the table would give you sufficient performance and a good improvement on size. There are lots of different ways you could achieve this - as suggested in a comment, a 2 bit field where the two or three most common values are stored, and then some alternative data format for the other values - a hash-table would be my first approach, but a list or binary tree may work too - again, it depends on the patterns of where your "not 0, 1 or 2" are. Again, it depends on how the values are "scattered" in the table - are they in clusters or are they more of an evenly distributed pattern?

But a problem with that is that you are still reading the data from RAM. You are then spending more code processing the data, including some code to cope with the "this is not a common value".

The problem with most common compression algorithms is that they are based on unpacking sequences, so you can't random access them. And the overhead of splitting your big data into chunks of, say, 256 entries at a time, and uncompressing the 256 into a uint8_t array, fetching the data you want, and then throwing away your uncompressed data, is highly unlikely to give you good performance - assuming that's of some importance, of course.

In the end, you will probably have to implement one or a few of the ideas in comments/answers to test out, see if it helps solving your problem, or if memory bus is still the main limiting factor.

Thanks! In the end, I'm just interested in whats quicker when 100% of the CPU is busy with looping over such arrays (different threads over different arrays). Currently, with a `uint8_t` array, the RAM bandwidth is saturated after ~5 threads are working on that at the same time (on a quad channel system), so using more than 5 threads no longer gives any benefit. I would like this to use >10 threads without running into RAM bandwidth issues, but if the CPU side of the access becomes so slow that 10 threads get less done than 5 threads before, that would obviously not be progress. — JohnAl, May 14 '18 at 07:00
@JohnAl How many cores do you have? If you are CPU bound, there's no point having more threads than cores. Also, maybe time to look at GPU programming? — Martin Bonner supports Monica, May 14 '18 at 09:54
@MartinBonner I do currently have 12 threads. And I agree, this would probably run very nicely on a GPU. — JohnAl, May 14 '18 at 10:07
@JohnAI: If you are simply running multiple versions of the same inefficient process on multiple threads, you will always see limited progress. There will be bigger wins in designing your algorithm for parallel processing than in tweaking a storage structure. — Jack Aidley, May 14 '18 at 10:48

score 13 · Answer 4 · answered May 14 '18 at 06:43

13

What I've done in the past is use a hashmap in front of a bitset.

This halves the space compared to Matteo's answer, but may be slower if "exception" lookups are slow (i.e. there are many exceptions).

Often, however, "cache is king".

answered May 14 '18 at 06:43

o11c

15,265
4
50
75

2

How exactly would a hashmap _halve the space compared to Matteo's answer_? What should be in that hashmap? – JohnAl May 14 '18 at 07:04
1

@JohnAl Using a 1-bit bitset=bitvec instead of a 2-bit bitvec. – o11c May 14 '18 at 08:50
2

@o11c I'm not sure if I understand it correctly. You mean to have an array of 1 bit values where `0` means _look at `main_arr`_ and `1` means _look at the `sec_arr`_ (in the case of Matteos code)? That would need overall more space than Matteos answer though, since its one additional array. I don't quite understand how you would do it only using half the space compared to Matteos answer. – JohnAl May 14 '18 at 08:56
1

Could you clarify this? You look up the expectional cases *first*, and *then* look in the bitmap? If so, I suspect the slow lookup in the hash will overwhelm the savings in reducing the size of the bitmap. – Martin Bonner supports Monica May 14 '18 at 10:23
I thought this was called hashlinking - but google turns up no relevant hits so it must be something else. The way it usually worked was to have say a byte array that would hold values the vast majority of which were, say, between 0..254. Then you'd use 255 as a flag, and if you had a 255 element you'd look up the true value in an associated hash table. Can someone remember what it was called? (I think I read about it in an old IBM TR.) Anyway, you could also arrange it the way @o11c suggests - always lookup in the hash first, if it is not there, look in your bit array. – davidbak May 14 '18 at 22:21
... and to (possibly) answer @MartinBonner's suggestion about the time performance - the time would be dominated (I imagine) not by the hash lookup but by the hash computation on the index - you could substitute something else as the "exceptional" data holder - e.g., an array mapped trie and replace the hash computation with something cheaper (bit shifting&masking, for example). – davidbak May 14 '18 at 22:28
@davidbak I would expect std::hash of an integer to be a multiplication (possibly with an addition). That's not going to dominate even an L1 cache miss. In this case (where the index is presumably uniformly distributed), I would use the integer as its hash. – Martin Bonner supports Monica May 15 '18 at 05:37

score 11 · Answer 5 · edited May 21 '18 at 20:27

Unless there is pattern to your data it is unlikely that there is any sensible speed or size optimisation, and - assuming you are targetting a normal computer - 10 MB isn't that big a deal anyway.

There are two assumptions in your questions:

The data is being poorly stored because you aren't using all the bits
Storing it better would make things faster.

I think both of these assumptions are false. In most cases the appropriate way to store data is to store the most natural representation. In your case, this is the one you've gone for: a byte for a number between 0 and 255. Any other representation will be more complex and therefore - all other things being equal - slower and more error prone. To need to divert from this general principle you need a stronger reason than potentially six "wasted" bits on 95% of your data.

For your second assumption, it will be true if, and only if, changing the size of the array results in substantially fewer cache misses. Whether this will happen can only be definitively determined by profiling working code, but I think it's highly unlikely to make a substantial difference. Because you will be randomly accessing the array in either case, the processor will struggle to know which bits of data to cache and keep in either case.

score 8 · Answer 6 · answered May 15 '18 at 19:58

If the data and accesses are uniformly randomly distributed, performance is probably going to depend upon what fraction of accesses avoid an outer-level cache miss. Optimizing that will require knowing what size array can be reliably accommodated in cache. If your cache is large enough to accommodate one byte for every five cells, the simplest approach may be to have one byte hold the five base-three encoded values in the range 0-2 (there are 243 combinations of 5 values, so that will fit in a byte), along with a 10,000,000 byte array that would be queried whenever an the base-3 value indicates "2".

If the cache isn't that big, but could accommodate one byte per 8 cells, then it would not be possible to use one byte value to select from among all 6,561 possible combinations of eight base-3 values, but since the only effect of changing a 0 or 1 to a 2 would be to cause an otherwise-unnecessary lookup, correctness wouldn't require supporting all 6,561. Instead, one could focus on the 256 most "useful" values.

Especially if 0 is more common than 1, or vice versa, a good approach might be to use 217 values to encode the combinations of 0 and 1 that contain 5 or fewer 1's, 16 values to encode xxxx0000 through xxxx1111, 16 to encode 0000xxxx through 1111xxxx, and one for xxxxxxxx. Four values would remain for whatever other use one might find. If the data are randomly distributed as described, a slight majority of all queries would hit bytes which contained just zeroes and ones (in about 2/3 of all groups of eight, all bits would be zeroes and ones, and about 7/8 of those would have six or fewer 1 bits); the vast majority of those that didn't would land in a byte which contained four x's, and would have a 50% chance of landing on a zero or a one. Thus, only about one in four queries would necessitate a large-array lookup.

If the data are randomly distributed but the cache isn't big enough to handle one byte per eight elements, one could try to use this approach with each byte handling more than eight items, but unless there is a strong bias toward 0 or toward 1, the fraction of values that can be handled without having to do a lookup in the big array will shrink as the number handled by each byte increases.

score 7 · Answer 7 · edited May 21 '18 at 20:34

I'll add to @o11c's answer, as his wording might be a bit confusing. If I need to squeeze the last bit and CPU cycle I'd do the following.

We will start by constructing a balanced binary search tree that holds the 5% "something else" cases. For every lookup, you walk the tree quickly: you have 10000000 elements: 5% of which is in the tree: hence the tree data structure holds 500000 elements. Walking this in O(log(n)) time, gives you 19 iterations. I'm no expert at this, but I guess there are some memory-efficient implementations out there. Let's guesstimate:

Balanced tree, so subtree position can be calculated (indices do not need to be stored in the nodes of the tree). The same way a heap (data structure) is stored in linear memory.
1 byte value (2 to 255)
3 bytes for the index (10000000 takes 23 bits, which fits 3 bytes)

Totalling, 4 bytes: 500000*4 = 1953 kB. Fits the cache!

For all the other cases (0 or 1), you can use a bitvector. Note that you cannot leave out the 5% other cases for random access: 1.19 MB.

The combination of these two use approximately 3,099 MB. Using this technique, you will save a factor 3.08 of memory.

However, this doesn't beat the answer of @Matteo Italia (which uses 2.76 MB), a pity. Is there anything we can do extra? The most memory consuming part is the 3 bytes of index in the tree. If we can get this down to 2, we would save 488 kB and the total memory usage would be: 2.622 MB, which is smaller!

How do we do this? We have to reduce the indexing to 2 bytes. Again, 10000000 takes 23 bits. We need to be able to drop 7 bits. We can simply do this by partitioning the range of 10000000 elements into 2^7 (=128) regions of 78125 elements. Now we can build a balanced tree for each of these regions, with 3906 elements on average. Picking the right tree is done by a simple division of the target index by 2^7 (or a bitshift >> 7). Now the required index to store can be represented by the remaining 16 bits. Note that there is some overhead for the length of the tree that needs to be stored, but this is negligible. Also note that this splitting mechanism reduces the required number of iterations to walk the tree, this now reduces to 7 iterations less, because we dropped 7 bits: only 12 iterations are left.

Note that you could theoretically repeat the process to cut off the next 8 bits, but this would require you to create 2^15 balanced trees, with ~305 elements on average. This would result in 2.143 MB, with only 4 iterations to walk the tree, which is a considerable speedup, compared to the 19 iterations we started with.

As a final conclusion: this beats the 2-bit vector strategy by a tiny bit of memory usage, but is a whole struggle to implement. But if it can make the difference between fitting the cache or not, it might be worth the try.

Try this: Since 4% of the cases are the value 2 ... create a set of exceptional cases (>1). Create a tree somewhat as described for really exceptional cases (>2). If present in set and tree then use value in tree; if present in set and _not_ tree then use value 2, otherwise (not present in set) lookup in your bitvector. Tree will contain only 100000 elements (bytes). Set contains 500000 elements (but no values at all). Does this reduce size while justifying its increased cost? (100% of lookups look in set; 5% of lookups need to look in tree also.) — davidbak, May 15 '18 at 19:54
You always want to use a CFBS-sorted array when you have an immutable tree, so there is no allocation for the nodes, just the data. — o11c, Jun 01 '18 at 01:29

score 5 · Answer 8 · edited May 21 '18 at 20:31

If you only perform read operations it would be better to not assign a value to an single index but to an interval of indices.

For example:

[0, 15000] = 0
[15001, 15002] = 153
[15003, 26876] = 2
[25677, 31578] = 0
...

This can be done with a struct. You also might want to define a class similar to this if you like an OO approach.

class Interval{
  private:
    uint32_t start; // First element of interval
    uint32_t end; // Last element of interval
    uint8_t value; // Assigned value

  public:
    Interval(uint32_t start, uint32_t end, uint8_t value);
    bool isInInterval(uint32_t item); // Checks if item lies within interval
    uint8_t getValue(); // Returns the assigned value
}

Now you just have to iterate trough a list of intervals and check if your index lies within one of them which can be much less memory intensive in average but costs more CPU resources.

Interval intervals[INTERVAL_COUNT];
intervals[0] = Interval(0, 15000, 0);
intervals[1] = Interval(15001, 15002, 153);
intervals[2] = Interval(15003, 26876, 2);
intervals[3] = Interval(25677, 31578, 0);
...

uint8_t checkIntervals(uint32_t item)

    for(int i=0; i<INTERVAL_COUNT-1; i++)
    {
        if(intervals[i].isInInterval(item) == true)
        {
            return intervals[i].getValue();
        }
    }
    return DEFAULT_VALUE;
}

If you order the intervals by descending size you increase the probability that the item you are looking for is found early which further decreases your average memory and CPU resource usage.

You could also remove all intervals with a size of 1. Put the corresponding values into a map and check them only if the item you are looking for wasn't found in the intervals. This should also raise the average performance a bit.

Interesting idea (+1) but I am somewhat skeptical that it would justify the overhead unless there are a lot of long runs of 0's and/or long runs of 1's. In effect you are suggesting using a run-length encoding of the data. It might be good in some situations but probably isn't a good general approach to this problem. — John Coleman, May 15 '18 at 12:06
Right. In particular for random access, this is almost certainly _slower_ than a simple array or `unt8_t`, even if it takes much less memory. — leftaroundabout, May 16 '18 at 14:02

Horitsu · Answer 9 · 2018-05-17T05:27:28.870

Long long time ago, I can just remember...

In university we got a task to accelerate a ray tracer program, that has to read by algorithm over and over again from buffer arrays. A friend told me to always use RAM-reads that are multiples of 4Bytes. So I changed the array from a pattern of [x1,y1,z1,x2,y2,z2,...,xn,yn,zn] to a pattern of [x1,y1,z1,0,x2,y2,z2,0,...,xn,yn,zn,0]. Means I add a empty field after each 3D coordinate. After some performance testing: It was faster. So long story short: Read multiple of 4 Bytes from your array from RAM, and maybe also from the right starting position, so you read a little cluster where the searched index is in it and read the searched index from this little cluster in cpu. (In your case you will not need to inserting fill-fields, but the concept should be clear)

Maybe also other multiples could be the key in newer systems.

I don't know if this will work in your case, so if it doesn't work: Sorry. If it work I would be happy to hear about some test results.

PS: Oh and if there is any access pattern or nearby accessed indices, you can reuse the cached cluster.

PPS: It could be, that the multiple factor was more like 16Bytes or something like that, it's too long ago, that I can remember exactly.

You are probably thinking about cachelines, which are usually 32 or 64 bytes, but that wont help much here as the access is random. — Surt, May 20 '18 at 21:24

score 3 · Answer 10 · answered May 14 '18 at 06:43

3

Looking at this, you could split your data, for example:

a bitset which gets indexed and represents the value 0 (std::vector would be useful here)
a bitset which gets indexed and represents the value 1
a std::vector for the values of 2, containing the indexes which refer to this value
a map for the other values (or std::vector>)

In this case, all values appear till a given index, so you could even remove one of bitsets and represents the value as it being missing in the other ones.

This will save you some memory for this case, though would make the worst case worse. You'll also need more CPU power to do the lookups.

Make sure to measure!

answered May 14 '18 at 06:43

JVApen

11,008
5
31
67

1

A bitset for ones/zeros. A set of indices for twos. And a sparse associative array for the rest. – Red.Wave May 15 '18 at 07:04
That's the short summary – JVApen May 15 '18 at 10:46
Let the OP know the terms, so he can search for alternate implementations of each. – Red.Wave May 15 '18 at 12:00

score 2 · Answer 11 · answered May 15 '18 at 13:10

Like Mats mentions in his comment-answer, it is hard to say what is actually the best solution without knowing specifically what kind of data you have (e.g., are there long runs of 0's, and so on), and what your access pattern looks like (does "random" mean "all over the place" or just "not strictly in completely linear fashion" or "every value exactly once, just randomized" or ...).

That said, there are two mechanisms coming to mind:

Bit arrays; i.e., if you only had two values, you could trivially compress your array by a factor of 8; if you have 4 values (or "3 values + everything else") you can compress by a factor of two. Which might just not be worth the trouble and would need benchmarks, especially if you have really random access patterns which escape your caches and hence do not change the access time at all.
(index,value) or (value,index) tables. I.e., have one very small table for the 1% case, maybe one table for the 5% case (which only needs to store the indexes as all have the same value), and a big compressed bit array for the final two cases. And with "table" I mean something which allows relatively quick lookup; i.e., maybe a hash, a binary tree, and so on, depending on what you have available and your actual needs. If these subtables fit into your 1st/2nd level caches, you might get lucky.

score 1 · Answer 12 · edited May 21 '18 at 20:35

1

I am not very familiar with C, but in C++ you can use unsigned char to represent an integer in the range 0 - 255.

Compared to normal int (again, I am coming from Java and C++ world) in which 4 byte (32 bit) are required, an unsigned char requires 1 byte (8 bits). so it might reduce the total size of the array by 75%.

edited May 21 '18 at 20:35

Peter Mortensen

30,738
21
105
131

answered May 18 '18 at 20:22

Adi

25
3

That is probably already the case with the use of `uint8_t ` - the 8 means 8 bits. – Peter Mortensen May 21 '18 at 20:36

score -4 · Answer 13 · answered May 14 '18 at 15:32

-4

You have succinctly described all the distribution characteristics of your array; toss the array.

You can easily replace the array with a randomized method that produces the same probabilistic output as the array.

If consistency matters (producing the same value for the same random index), consider using a bloom filter and/or hash map to track repeat hits. If your array accesses really are random, though, this is totally unnecessary.

answered May 14 '18 at 15:32

Dúthomhas

8,200
2
17
39

18

I suspect "random access" was being used here to indicate that accesses are unpredictable, not that they are actually random. (i.e. it's intended in the sense of "random access files") – Michael Kay May 14 '18 at 17:16
Yes, that is likely. OP isn't clear, however. If OP's accesses are in any way not random, then some form of sparse array is indicated, as per the other answers. – Dúthomhas May 14 '18 at 18:08
1

I think you have a point there, since the OP indicated he would loop over the entire array in a random order. For the case that only distributions need to be observed, this is a good answer. – Ingo Schalk-Schupp May 15 '18 at 21:11

Any optimization for random access on a very big array when the value in 95% of cases is either 0 or 1?

13 Answers13

Quick benchmark

Linked