28

Is there any reliable way to force GCC (or any compiler) to factor out runtime size checks in memcpy() outside of a loop (where that size is not compile-time constant, but constant within that loop), specializing the loop for each relevant size range rather than repeatedly checking the size within it?

This is an test case reduced down from a performance regression reported here for an open source library designed for efficient in-memory analysis of large data sets. (The regression happens to be because of one of my commits...)

The original code is in Cython, but I've reduced it down to a pure C proxy as the following:

void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, j, k_local;
    k_local = k; /* prevent aliasing */
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        for(j = 0; j < k_local; ++j)
            out[i * stride_out_0 + j * stride_out_1] =
            in[idx * stride_in_0 + j * stride_in_1];
    }
}

The strides are variable; in general the arrays are not even guaranteed to be contiguous (since they might be non-contiguous slices of larger arrays.) However, for the particular case of c-contiguous arrays, I've optimized the above to the following:

void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, k_local;
    assert(stride_out_0 == k);
    assert(stride_out_0 == stride_in_0);
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    k_local = k; /* prevent aliasing */
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        memcpy(&out[i * k_local], &in[idx * k_local],
               k_local * sizeof(double));
    }
}

(The asserts are not present in the original code; instead it checks for contiguity and calls the optimized version if possible, and the unoptimized one if not.)

This version optimizes very well in most cases, since the normal use case if for small n and large k. However, the opposite use case does happen as well (large n and small k), and it turns out for the particular case of n == 10000 and k == 4 (which cannot be ruled out as representative of an important part of a hypothetical workflow), the memcpy() version is 3.6x times slower than the original. This is, apparently, mainly due to the fact that k is not compile-time constant, as evidenced by the fact that this next version performs (almost or exactly, depending on optimization settings) as well as the original (or better, sometimes), for the particular case of k == 4:

    if (k_local == 4) {
        /* this optimizes */
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    }

Obviously, it's not practical to hardcode a loop for every particular value of k, so I attempted the following instead (as a first attempt that could later by generalized, if it worked):

    if (k_local >= 0 && k_local <= 4) {
        /* this does not not optimize */
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    }

Unfortunately, this last version is no faster than the original memcpy() version, which is somewhat disheartening for my faith in GCC's optimization abilities.

Is there any way I can give extra "hints" to GCC (through any means) that will help it do the right thing here? (And even better, are there "hints" that could reliably work across different compilers? This library is compiled for many different targets.)

The results quoted are for GCC 4.6.3 on 32-bit Ubuntu with the "-O2" flag, but I've also tested GCC 4.7.2 and "-O3" versions with similar (but not identical) results. I've posted my test harness to LiveWorkspace, but the timings are from my own machine using the time(1) command (I don't know how reliable LiveWorkspace timings are.)

EDIT: I've also considered just setting a "magic number" for some minimum size to call memcpy() with, and I could find such a value with repeated testing, but I'm not sure how generalizable my results would be across different compilers/platforms. Is there any rule of thumb I could use here?

FURTHER EDIT: Realized the k_local variables are kind of useless in this case, actually, since no aliasing is possible; this was reduced from some experiments I ran where it was possible (k was global) and I forgot I changed it. Just ignore that part.

EDIT TAG: Realized I can also use C++ in newer versions of Cython, so tagging as C++ in case there's anything that can help from C++...

FINAL EDIT: In lieu (for now) of dropping down to assembly for a specialized memcpy(), the following seems to be the best empirical solution for my local machine:

    int i, idx, j;
    double * subout, * subin;
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    if (k < 32 /* i.e. 256 bytes: magic! */) {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            subout = &out[i * stride_out_0];
            subin = &in[idx * stride_in_0];
            for(j = 0; j < k; ++j)
                subout[j] = subin[j];
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            subout = &out[i * stride_out_0];
            subin = &in[idx * stride_in_0];
            memcpy(subout, subin, k * sizeof(double));
        }
    }

This uses a "magic number" to decide whether to call memcpy() or not, but still optimizes the case for small arrays that are known to be contiguous (so it's faster than the original, in most cases, since the original makes no such assumption).

Stephen Lin
  • 5,470
  • 26
  • 48
  • 1
    I think the memory layout you describe [here](http://mail.python.org/pipermail/pandas-dev/2013-March/000008.html) is a pathological case which is bound to produce lots of cache & TLB misses. Can you measure those? – Michael Foukarakis Mar 21 '13 at 17:58
  • @MichaelFoukarakis sure, any suggestions on what I should try? not really sure what I should be varying and experimenting with in my attempts; haven't really had much experience with cache issues. – Stephen Lin Mar 21 '13 at 17:59
  • Typically you want to vary (x, y) dimensions for each memory layout (row-major (C), column-major, others like Z-ordering, etc.) and see access patterns in terms of misses. – Michael Foukarakis Mar 21 '13 at 18:06
  • @MichaelFoukarakis ok, I'll try, but regardless, is my analysis about `memcpy()` reasonable? in theory, the compiler ought to be able to choose a "small array" version in the loop, right, but it's not... – Stephen Lin Mar 21 '13 at 18:07
  • @MichaelFoukarakis how do I diagnose cache misses btw? any link you can provide to a tutorial? – Stephen Lin Mar 21 '13 at 18:08
  • 1
    I use [PAPI](http://icl.cs.utk.edu/papi/) for this kind of measurements. As for optimizing memcpy, I think you should look into the source code of your libc. – Michael Foukarakis Mar 21 '13 at 18:21
  • @MichaelFoukarakis sure, I will, but this is supposed to be compiled on many different platforms...do I not have any hope of getting what I want (basically loop unswitching) to work reliably? if not, is there any rule of thumb I can use for a "magic threshold", or do I basically just have to keep experimenting across different process/compiler/platform combos? (and/or just stop caring about pathological cases...) – Stephen Lin Mar 21 '13 at 18:23
  • @MichaelFoukarakis and btw, isn't `memcpy()` a compiler intrinsic anyway, for GCC? – Stephen Lin Mar 21 '13 at 18:26
  • you might be able to get the right coercion using `__builtin_expect`, or try the opposite approach and create an always_inline'd pseudo-memcpy clone thats just a liner copy loop + switch for trailing bytes, it may for some reason get optimized better by the value range propagation pass. – Necrolis Mar 21 '13 at 22:10
  • @Necrolis I'll try `__builtin_expect`...and making a specialized version is fine, but should I just guess and check a cutoff then? I'm afraid of over-fitting to my local machine specs (this is a cross-platform library released as Python and Cython source), but if there's no other option I guess I don't have a choice... – Stephen Lin Mar 21 '13 at 22:15
  • @Necrolis ideally the compiler should know its own cutoff points and optimize itself, rather than me having to guess universal values, but I suppose it could be a bit too much to ask... – Stephen Lin Mar 21 '13 at 22:17
  • 1
    You could try feedback directed optimization, maybee gcc will do something clever with that info? Otherwise, I think the magic cutoff is the way to go. – Mackie Messer Mar 21 '13 at 22:29
  • @MackieMesser le sigh. maybe I should just patch GCC. – Stephen Lin Mar 21 '13 at 22:30
  • What effect does using 'const' and/or 'register' have on the k_local, idx and i variables? Both k_local is a candidate for both and i and idx are both candidates for register variables. The explicit declaration may give the optimizer the clue that it can make more assumptions about k_local in particular -- which is the trigger for the optimization path. Honestly... only thinking out loud here. – K Scott Piel Mar 21 '13 at 19:26
  • this kind of thing is probably better as a comment, btw, but thanks...will try – Stephen Lin Mar 21 '13 at 19:28
  • I wouldn't allow me to comment ~smile~ Only route I had to offer feedback. I'm still too newb on the site to use the comment feature. Hope it helps. – K Scott Piel Mar 21 '13 at 19:30
  • no dice on `const register k_local = k` :( – Stephen Lin Mar 21 '13 at 19:30
  • oh, right, forgot about that...I'll +1 you to help you along...everyone else, please do not downvote! – Stephen Lin Mar 21 '13 at 19:30
  • I just +1'ed your other answers too, welcome to commenter status! :D – Stephen Lin Mar 21 '13 at 19:58
  • Thanks for the assist. ~smile~ My gut is telling me the issue is a memory boundary thing... the k_local == 4 is kind of a magic number. Is the use of inline assembler an option? If you're gonna bulk move blocks of memory, why not cut out the middle man? Sometimes hand optimization is the only route. – K Scott Piel Mar 21 '13 at 20:05
  • you're probably right, but it need to be portable unfortunately – Stephen Lin Mar 21 '13 at 20:07
  • Is this an autotools project? If so, you can use the configure.ac to detect processor type and #ifdef your way out of it. Not pretty... but, then, when you're trying to squeeze out every CPU cycle it tends to be that way. I think the issue comes down to the optimizer has to work with what it knows at compile-time and since you can't know the leaps and bounds until runtime, there's really no way to optimize for best-case. The inline-asm, at least, would get you around making functions calls, push/pops, and a host of other issues that are going to pound performance. – K Scott Piel Mar 21 '13 at 20:13
  • hmm, it's Cython actually (a Python-like language that compiles down to C, for writing Python extension modules), but it does support conditional compilation so it could theoretically be possible to hack something up I suppose...I can't write inline assembly in Cython but I could provide it in a header and try to force inlining somehow...anyway, thanks for the tip! this could be good assembly practice. – Stephen Lin Mar 21 '13 at 20:14
  • anyway, theoretically, the compiler knows which ranges of `k` change the behavior of `memcpy` (since its an intrinsic), so it ought to be able to factor that out, right? it's just not doing that because no one bothered to implement it, as far as I can tell...or am I missing something? – Stephen Lin Mar 21 '13 at 20:17
  • I;m not so sure about that... unless the compiler is going to link in multiple versions of memcpy() and call the "best" one for a given value of k_local, I don't see how it could do that at runtime. I'd go back to the earlier post that suggested looking at the memcpy() implementation source. The issue, in the end, is the compiler has no way of knowing, or assuming, the value of k_local at compile time... it's an unbounded runtime variable. The only hope would be for memcpy() to do something intelligent with it at runtime. (or inline asm ~grin~) – K Scott Piel Mar 21 '13 at 20:23
  • but in this particular case, it's bounded between [0, 4] before the loop, actually; and it already knows to optimize when it's bounded [4, 4]; is there really that much difference between the output along that range of values that it can't come up with an optimized version? also I'm pretty sure `memcpy()` is an intrinsic, not a linked function (at least, it can be, there might be a linked version as well for cases when the intrinsic cannot be used.) – Stephen Lin Mar 21 '13 at 20:26
  • I'm pretty sure memcpy is intrinsic, as well. But I think the thing that throws you is that though you're doing a conditional test, you're assuming the optimizer is going to consider all of the possible values and go. I don't think most will. Most will take advantage of registers and constants where they can, but few are going to try and guess what's happening at runtime. That said, I wonder if making k_local an unsigned value and a simpler conditional (k_local < 4) might be enough? – K Scott Piel Mar 21 '13 at 21:33
  • perhaps, the original cython is using signed which is why I wrote it that way, but I'll check – Stephen Lin Mar 21 '13 at 21:37
  • no dice on unsigned, I guess I'm going with a magic number lower limit for now and possibly assembly in a header later...this is like finding out santa claus doesn't exist though – Stephen Lin Mar 21 '13 at 21:39
  • ~lol~ Don't tell the Easter Bunny! – K Scott Piel Mar 21 '13 at 21:42
  • You know... in hind sight, I wonder if the optimizer isn't actually optimizing out the conditional. I bet it is. The two code blocks are identical, therefore the conditional is irrelevant and the optimizer says "I don't need that" and throws it away. – K Scott Piel Mar 21 '13 at 21:45
  • maybe for giggles; set up a switch case for 1-4 and use a constant loop --- for( register int i = 0; i < 4; i++ )... or lose the loop entirely and just do a memcpy( out, in, 4 * sizeof( double ) ); – K Scott Piel Mar 21 '13 at 21:48
  • the loop is necessary since the indexing is non-contiguous though (`indexer[i]` is randomly distributed in this case...in real world usage it's provided by the user) – Stephen Lin Mar 21 '13 at 21:54
  • True -- but in the trivial cases you're trying to optimize for, you can account for that using a constant based loop (which can be optimized) or a series of hard coded memcpy calls with no loop. Of course, this is of no value if there are "random" values for k_local that cause this issue other than the simple 0-4 case... so that's not an acceptable answer. On the other hand, if the penalty case is finite, you could code around it with a switch case and no loop. But, at the core, I think your test case is optimizing out the conditional. – K Scott Piel Mar 21 '13 at 21:58
  • yeah, but [0,4] is just to try something out though, `k` could vary arbitrarily (and this applied to other datatypes than just `double`, so the total size might not even be multiples of 8 bytes), although it is generally the case that `k` >> `n` or `n` >> `k` (even the smaller one could be pretty large though)...let me try artificially changing one the branches with an effective but non-optimizable no-op (i.e. adding 0.0 somewhere, since that's not technically a no-op for doubles) just to force them not to collapse... – Stephen Lin Mar 21 '13 at 22:04
  • no luck again :( even tried a non-zero number just to make sure – Stephen Lin Mar 21 '13 at 22:09
  • Daggummit... man... programming is hard. ~grin~ – K Scott Piel Mar 21 '13 at 22:11

3 Answers3

7

Ultimately, the issue at hand is one of asking the optimizer to make assumptions about runtime behavior based on multiple variables. While it is possible to provide the optimizer some compile-time hints via the use of 'const' and 'register' declarations on the key variables, ultimately, you're depending on the optimizer to make a lot of assumptions. Further, while the memcpy() may well be intrinsic, it's not guaranteed to be and even if/when it is, the implementation(s) could vary fairly widely.

If the goal is to achieve maximum performance, sometimes you just have to not rely on technology to figure it out for you, rather do it directly. The best advice for this situation is the use of inline assembler to address the problem. Doing so allows you to avoid all of the pitfalls of a "black box" solution compliments of the heuristics of the compiler and optimizer and to finitely state your intent. The key benefit to the use of inline assembler is the ability to avoid any pushes/pops and extraneous "generalization" code in the solution to the memory copy problem and the ability to take direct advantage of the processor's ability to solve the problem. The down side is maintenance, but given that you really need only address Intel and AMD to cover most of the market, it's not insurmountable.

I might add, too, that this solution could well allow you to take advantage of multiple cores/threads and/or a GPU if/when available to do the copying in parallel and truly get a performance gain. While the latency might be higher, the throughput would very likely be much higher, as well. If, for example, you could take advantage of a GPU when present, you could well launch one kernel per copy and copy thousands of elements in a single operation.

The alternative to this is to depend on the compiler/optimizer to make the best guesses for you, use the 'const' and 'register' declarations where you can to offer the compiler hints and use magic numbers to branch based on "best solution" paths... this, however, is going to be exceptionally compiler/system dependent and your mileage will vary widely from one platform/environment to another.

K Scott Piel
  • 4,320
  • 14
  • 19
  • http://meta.stackexchange.com/questions/172955/viewing-recovering-deleted-answers-or-at-least-comments-to-those-answers-to-on – Stephen Lin Mar 22 '13 at 00:58
  • 5
    Answers aren't really for discussion, they're for answering the question posed. In this case, if a future Googler comes upon this post, how much 'discussion' will he have to wade through to get the answer? Stack Overflow is set up so that number is really low. Please edit out the discussion-y parts and distill the answer down to what really answers the question posed. – George Stocker Mar 22 '13 at 01:26
  • @KScottPiel, after George's edit, this is worded somewhat misleadingly (understandably, since George wasn't part of the discussion)...if you could edit it and clarify that `const`, `register`, and other hints to the optimizer don't actually work in this case (unfortunately), I'll accept it; otherwise (if no other answer comes along) I'll have to write one myself that is clearer – Stephen Lin Mar 22 '13 at 02:09
  • 2
    BTW, using a GPU would make the problem much slower. Copying RAM --PCIe--> GPU --PCIe--> RAM is much slower than copying RAM --> RAM. The CPU has faster access to the CPU's RAM than a GPU does. – Mr Fooz Mar 22 '13 at 12:21
  • Not true if you're dealing with large arrays. If you're going to loop through a large number of small copies, then there's a ton of overhead per copy. For the sake of argument... suppose you need to copy 2000 elements and it takes {x} time per copy to complete in the loop... that makes the copy time 2000{x} for the entire array. Conversely, you can copy the array from CPU to GPU in {a}ns, copy each element in {y} and copy the array back in {a}ns again... making the entire copy time 2{a} + {y}. Which one's faster? – K Scott Piel Mar 22 '13 at 12:35
  • add `const p` to your parameters in the prototype, or if you hate the look of that polluting your published prototype, make more local copies of the parameters and make the copies const. I'm not sure that would help the optimiser, but it's good for maintenance anyway. Your parameters of type `int` should really be `const size_t` anyway, which might just help with code quality on a 64-bit machine as the types of the variables like j will match the type required for an index. – Cecil Ward Aug 29 '16 at 02:35
  • Using `size_t` for every array index will also get rid of the unnecessary checks for `i < 0` etc, and make the multiplications unsigned which can't hurt. These suggestions may well I've you very little improvement I'm afraid, but are good for the soul. :-) – Cecil Ward Aug 29 '16 at 02:39
2

SSE/AVX and Alignment

If you're on, for example, a modern-ish Intel processor then use of SSE or AVX instructions is an option. Whilst not specifically about GCC, see this If you're interested and flush with cache I think Intel do a version of their compiler suite for Linux as well as Windows, and I guess that comes with its own suite of libraries.

There's also this post.

Threads (eek)

I've had exactly this sort of problem fairly recently, a memcpy() taking too much time. In my instance it was one big memcpy() (1MByte or so) rather than a lot of smaller ones like you're doing.

I got very good mileage by writing my own multi-threaded memcpy() where the threads were persistent and got 'tasked' with a share of the job by a call my own pmemcpy() function. The persistent threads meant that the overhead was pretty low. I got a x4 improvement for 4 cores.

So if it were possible to break your loops down into a sensible number of threads (I went for one per available core), and you had the luxury of a few spare cores on your machine you might get a similar benefit.

What the real time crowd do - DMA

Just as an aside, I have the pleasure of playing around with some fairly exotic OpenVPX hardware. Basically it's a bunch of boards in a big box with a high speed serial RapidIO interconnect between them. Each board has a DMA engine that drives data across the sRIO to another board's memory.

The vendor I went to is pretty clever at how to maximise the use of a CPU. The clever bit is that the DMA engines are pretty smart - they can be programmed to do things like matrix transformations on the fly, strip mining, things like you're trying to do, etc. And because it's a separate piece of hardware the CPU isn't tied up in the meantime, so that can be busy doing something else.

For example, if you're doing something like Synthetic Aperture Radar processing you always end up doing a big matrix transform. The beauty is that the transform itself takes no CPU time at all - you just move the data to another board and it arrives already transformed.

Anyway, having the benefit of that sort of thing really makes one wish that Intel CPUs (and other) have onboard DMA engines capable of working memory-memory instead of just memory-peripheral. That would make tasks like yours really quick.

Community
  • 1
  • 1
bazza
  • 7,580
  • 15
  • 22
  • this is good info about memcpy in general so I +1'd you for effort, but does this really apply in this case? the cases of large `k` are fine (well, they could be optimized more, but that's something else...); the problem is how to avoid the regression in performance for small `k` and large `n` without forcing a cutoff value? – Stephen Lin Mar 21 '13 at 22:11
  • I might resort to assembly but this library is generally released as platform-independent Python and Cython sources (the Cython is translate to C and then compiled on the user machine with its host compiler, except on Windows where it's generally prebuilt)...so knowing about the assembly instructions does help...but I'd rather get GCC to do the reasonable thing on its own if at all possible (as well as make it work on other compilers...it might not be possible but who knows?) – Stephen Lin Mar 21 '13 at 22:13
  • also, a cutoff value is ok if I'm doing this for a known processor/compiler/platform combo, but I'm trying to make it as portable as possible – Stephen Lin Mar 21 '13 at 22:18
  • ok, that will help in general and I might try it for large data sets (where either dimension is large), but I think its kind of orthogonal to the issue at hand because I still want each individual call to `memcpy()` to be fast (or at least not be slower than a loop) or at least have some reliable (cross-platform) way of knowing whether I should call `memcpy()` or an explicit loop. – Stephen Lin Mar 21 '13 at 22:45
  • Well if you split your outer loop up between, say, 4 threads (assuming that suited the value of n reasonably well) it would help. Yes, you'd still be calling memcpy a lot, but you'd have 4 threads doing that, not one. And if n was too small but k large then you'd split the inner loop up between 4 threads. The object would be to give each thread a decent amount to do. – bazza Mar 21 '13 at 22:49
  • I still have to decide whether to call `memcpy()` at all, though? even if I parallelize, I wouldn't want to call it if it ends up being slower than just doing parallel array indexing. so it's good info that I might take into consideration, just orthogonal to this issue. – Stephen Lin Mar 21 '13 at 22:51
  • I did try writing an explicit loop that used an SSE optimised function instead of memcpy, noticed no improvement itself. That's why I went multi-threaded because it was clear that in my case memcpy was already quite optimal in comparison to any of the clever tricks. Sorry, I've just read your post more carefully, noticed the bit about python. Trouble is python AFAIK doesn't do threads well, it's a single interpretter thread afterall. You'd have to go for multi-process instead to get the benefit. – bazza Mar 21 '13 at 22:54
  • it's within Cython, so it's possible to release the GIL and go multithreaded within a single function call if it helps; I'm not sure we actually have use cases that often where it will though, but I could be wrong – Stephen Lin Mar 21 '13 at 22:57
  • If you did parallelise I would reckon on just calling memcpy all the time rather than trying to work out what'd be best. It might not be totally optimal, but the overally speed up should still be worthwhile. – bazza Mar 21 '13 at 23:00
  • @bazaa just curious btw, when writing a multithreaded `memcpy()`, do you have to try avoiding writing to the same page and/or cache line as another thread to make it optimal? maybe it's not an issue (cache lines are pretty small?)...I just don't know hardware very well – Stephen Lin Mar 21 '13 at 23:00
  • If you could get away with using C++11 that does threads as part of the language, not as an add-on library. That'd at least be multiplatform. Does Cython allow for that sort of thing? – bazza Mar 21 '13 at 23:01
  • @bazaa it might since it it's just new classes for the most part, but I don't think we can rely on the host machine having a C++11 compiler anytime soon (nor would we be willing to make it hard dependency for installing) – Stephen Lin Mar 21 '13 at 23:02
  • In my case I had 4 threads doing a decent amount of work on non-overalapping data. That meant that within L1 and L2 cache on an i7 there was no cache / page screw ups. At the very boundaries of the each thread's part there could conceivably be L3 foul ups. But then there's only 3 boundaries between the four threads, so even if they happened it would be neglible in comparison to the overall task. – bazza Mar 21 '13 at 23:05
  • Hmmm, looks like C++11 isn't really available in practise. Boost is pretty pervasive though, works in gcc and MS land, does threads. That ought to allow you to have the same source code everywhere. – bazza Mar 21 '13 at 23:11
  • Just going back to whether or not to use memcpy() - as your data in the general case mightn't be contiguous I'd first write it not to use memcpy() at all. If it were me I'd adapt your first example so that the outer loop was shared between some threads, and try that. If I liked that then I'd write an alternate where the inner loop was split up between threads. Thing is that alternate would only ever be useful if n was, say, 2 or 1. If it were 3 or more then 3 or 4 threads doing part of the outer loop should be pretty good still even if n mod 4 != 0. Whatever your choice, good luck :) – bazza Mar 21 '13 at 23:19
  • @bazaa thanks, it's definitely contiguous often enough though for this to help, but i'll look into parallelizing anyway at some point – Stephen Lin Mar 21 '13 at 23:23
2

I think the best way is to experiment and find out the optimal "k" value to switch between the original algorithm (with a loop) and your optimized algorithm using memcpy. The optimal "k" will vary across different CPU's, but shouldn't be drastically different; essentially it's about the overhead of calling memcpy, overhead in memcpy itself in choosing the optimal algorithm (based on size, alignment, etc.) vs. the "naive" algorithm with a loop.

memcpy is an intrinsic in gcc, yes, but it doesn't do magic. What it basically does is that if the size argument is known at compile-time and small-ish (I don't know what the threshold is), then GCC will replace the call to the memcpy function with inline code. If the size argument is not known at compile time, a call to the library function memcpy will always be made.

janneb
  • 36,249
  • 2
  • 81
  • 97
  • +1, but I'm not asking for "magic": this is just loop unswitching with a function body completely known by the compiler! the compiler has all the information it needs to know that `k` is bounded between [0, 4] and that it can pick a "small array" version in my last version...very frustrating that it can't use that information... – Stephen Lin Mar 22 '13 at 16:22
  • @StephenLin: I think the problem is that such an optimization might bloat the code size too much to be worth doing in the general case. – janneb Mar 23 '13 at 21:23
  • I understand, but in this case I've already chosen to bloat the code by doubling it, the compiler just isn't taking the hint – Stephen Lin Mar 23 '13 at 22:07
  • What optimisation-related switched are you using to the compiler? eg `‑O2` vs `‑O3` and `‑ffast‑math` (dangerous, slightly). – Cecil Ward Aug 29 '16 at 02:45