Allocating initialized, aligned memory

Question

I'm writing a program (in C++) in which I need to allocate arrays whose starting addresses should be aligned with the cache line size. When I allocate these arrays I also want the memory initialized to zero.

Right now I have it working using the posix_memalign function. This works well for getting memory aligned arrays but the arrays are uninitilized. Is there a better function I can use to zero out the arrays when I initialize them or do I just have to settle for writing a separate loop to do it for me?

David Schwartz · Accepted Answer · 2012-12-17T07:56:33.553

9

Just call memset on the block. Make sure you don't cast the pointer to a type that's expensive to set (like char *) before calling memset. Since your pointer will be aligned, make sure that information isn't hidden from the compiler.

Update: To clarify my point about not hiding alignment, compare:

char* mem_demo_1(char *j)
{ // *BAD* compiler cannot tell pointer alignment, must test
    memset(j, 0, 64);
    return j;
}

char* mem_demo_2(void)
{ // *GOOD* compiler can tell pointer alignment
    char * j = malloc(64);
    memset(j, 0, 64);
    return j;
}

With GCC, mem_demo_1 compiles to 60 lines of assembly while mem_demo_2 compiles to 20. The performance difference is also huge.

edited Dec 17 '12 at 07:56

answered Dec 17 '12 at 04:57

David Schwartz

179,497
17
214
278

1

Could you please explain `Make sure you don't cast the pointer to a type that's expensive to set (like char *) before calling memset`? – Dec 17 '12 at 05:08
@skwllsp I think he means that `char` is too small. – atoMerz Dec 17 '12 at 05:40
Thanks! What's wrong with using memset to clear a character array? What makes certain types more expensive than others? – martega Dec 17 '12 at 06:55
1

@martega: If you pass a `char *` to `memset`, the compiler cannot make any assumptions about alignment. If you pass a `long *` to `memset`, the compiler can assume the memory block is aligned on a `long` boundary and that makes the `memset` *much* more efficient. – David Schwartz Dec 17 '12 at 07:47
@David Schwartz. Please take a look at my answer. I would appreciate if you commented it. – Dec 19 '12 at 08:01
You managed to make both cases even worse than my worst case! In your answer, they both require a jump to the generic `memset` that makes no assumptions about alignment. So not only did you get the worst-case `memset`, you get an extra jump/return too! – David Schwartz Dec 19 '12 at 08:06
@David Schwartz Actually I also built it with enabled built-in memset. See in my post. – Dec 19 '12 at 08:13

score 0 · Answer 2 · 2012-12-19T14:41:55.097

With GCC, mem_demo_1 compiles to 60 lines of assembly while mem_demo_2 compiles to 20. The performance difference is also huge.

I have decided to verify this statement on Linux 2.6.32 with gcc 4.4.6. First

mem_demo_1 compiles to 60 lines of assembly while mem_demo_2 compiles to 20

.

This is the test (in file main.c):

  #include <stdlib.h>
  #include <stdio.h>
  #include <string.h>

  char* mem_demo_1(char *j)
  {
      // *BAD* compiler cannot tell pointer alignment, must test
      memset(j, 0, 64);
      return j;
  }

  char* mem_demo_2(void)
  {
    // *GOOD* compiler can tell pointer alignment
    char * j = malloc(64);
    memset(j, 0, 64);
    return j;
  }

  int main()
  {
    char *p;
    p = malloc(64);
    p = mem_demo_1(p);
    printf ("%p\n",p);
    free (p);

    p = mem_demo_2();
    printf ("%p\n",p);
    free (p);

    return 0;
  }

When I compile:

  gcc -fno-inline -fno-builtin -m64 -g -O2 main.c -o main.no_inline_no_builtin

I see that there are only 8 lines in mem_demo_1:

(gdb) disassemble mem_demo_1
Dump of assembler code for function mem_demo_1:
   0x00000000004005d0 <+0>:     push   %rbx
   0x00000000004005d1 <+1>:     mov    $0x40,%edx
   0x00000000004005d6 <+6>:     mov    %rdi,%rbx
   0x00000000004005d9 <+9>:     xor    %esi,%esi
   0x00000000004005db <+11>:    callq  0x400470 <memset@plt>
   0x00000000004005e0 <+16>:    mov    %rbx,%rax
   0x00000000004005e3 <+19>:    pop    %rbx
   0x00000000004005e4 <+20>:    retq
End of assembler dump.

I see that there are only 11 lines in mem_demo_2:

(gdb) disassemble mem_demo_2
Dump of assembler code for function mem_demo_2:
   0x00000000004005a0 <+0>:     push   %rbx
   0x00000000004005a1 <+1>:     mov    $0x40,%edi
   0x00000000004005a6 <+6>:     callq  0x400480 <malloc@plt>
   0x00000000004005ab <+11>:    mov    $0x40,%edx
   0x00000000004005b0 <+16>:    mov    %rax,%rbx
   0x00000000004005b3 <+19>:    xor    %esi,%esi
   0x00000000004005b5 <+21>:    mov    %rax,%rdi
   0x00000000004005b8 <+24>:    callq  0x400470 <memset@plt>
   0x00000000004005bd <+29>:    mov    %rbx,%rax
   0x00000000004005c0 <+32>:    pop    %rbx
   0x00000000004005c1 <+33>:    retq
End of assembler dump.

So, "mem_demo_1 compiles to 60 lines of assembly while mem_demo_2 compiles to 20" can't be confirmed.

When I compile:

  gcc -m64 -g -O2 main.c -o main.default

gcc uses its own implementation of memset and both functions mem_demo_1 and mem_demo_2 are bigger:

mem_demo_1: 43 instructions
mem_demo_2: 48 instructions

However, "mem_demo_1 compiles to 60 lines of assembly while mem_demo_2 compiles to 20" can't also be confirmed.

Second

"The performance difference is also huge"

I extented main.c in order to do lots of loops with memset. I also don't see that memset in mem_demo_1 is slower that in mem_demo_2. This is from Linux perf reports:
mem_demo_2 spends 8.37% in memset:

8.37% main.perf.no_bu libc-2.12.so [.] __memset_sse2

while mem_demo_1 spends 7.61% in memset:

7.61% main.perf.no_bu libc-2.12.so [.] __memset_sse2

And these are measurements themselfes:

# time ./main.perf.no_builtin_no_inline 100000000 1 0
number loops 100000000
mem_demo_1

real    0m3.483s
user    0m3.481s
sys     0m0.002s

# time ./main.perf.no_builtin_no_inline 100000000 2 0
number loops 100000000
mem_demo_2

real    0m3.503s
user    0m3.501s
sys     0m0.001s

By the way, this is how gcc -fverbose-asm -c -S -O3 shows me assembler for mem_demo_2:

char* mem_demo_2(void)
{
  char * j = malloc(64);
  memset(j, 0, 64);
  return j;
}

        .file   "main.mem_demo_2.c"
# GNU C (GCC) version 4.4.6 20110731 (Red Hat 4.4.6-3) (x86_64-redhat-linux)
#       compiled by GNU C version 4.4.6 20110731 (Red Hat 4.4.6-3), GMP version 4.3.1, MPFR version 2.4.1.
# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed:  main.mem_demo_2.c -m64 -mtune=generic -auxbase-strip
# main.mem_demo_2.default.asm -g -O3 -fverbose-asm
# options enabled:  -falign-loops -fargument-alias
# -fasynchronous-unwind-tables -fauto-inc-dec -fbranch-count-reg
# -fcaller-saves -fcommon -fcprop-registers -fcrossjumping
# -fcse-follow-jumps -fdefer-pop -fdelete-null-pointer-checks
# -fdwarf2-cfi-asm -fearly-inlining -feliminate-unused-debug-types
# -fexpensive-optimizations -fforward-propagate -ffunction-cse -fgcse
# -fgcse-after-reload -fgcse-lm -fguess-branch-probability -fident
# -fif-conversion -fif-conversion2 -findirect-inlining -finline
# -finline-functions -finline-functions-called-once
# -finline-small-functions -fipa-cp -fipa-cp-clone -fipa-pure-const
# -fipa-reference -fira-share-save-slots -fira-share-spill-slots -fivopts
# -fkeep-static-consts -fleading-underscore -fmath-errno -fmerge-constants
# -fmerge-debug-strings -fmove-loop-invariants -fomit-frame-pointer
# -foptimize-register-move -foptimize-sibling-calls -fpeephole -fpeephole2
# -fpredictive-commoning -freg-struct-return -fregmove -freorder-blocks
# -freorder-functions -frerun-cse-after-loop -fsched-interblock
# -fsched-spec -fsched-stalled-insns-dep -fschedule-insns2 -fsigned-zeros
# -fsplit-ivs-in-unroller -fsplit-wide-types -fstrict-aliasing
# -fstrict-overflow -fthread-jumps -ftoplevel-reorder -ftrapping-math
# -ftree-builtin-call-dce -ftree-ccp -ftree-ch -ftree-coalesce-vars
# -ftree-copy-prop -ftree-copyrename -ftree-cselim -ftree-dce
# -ftree-dominator-opts -ftree-dse -ftree-fre -ftree-loop-im
# -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
# -ftree-pre -ftree-reassoc -ftree-scev-cprop -ftree-sink -ftree-sra
# -ftree-switch-conversion -ftree-ter -ftree-vect-loop-version
# -ftree-vectorize -ftree-vrp -funit-at-a-time -funswitch-loops
# -funwind-tables -fvar-tracking -fvar-tracking-assignments
# -fvect-cost-model -fverbose-asm -fzero-initialized-in-bss
# -m128bit-long-double -m64 -m80387 -maccumulate-outgoing-args
# -malign-stringops -mfancy-math-387 -mfp-ret-in-387 -mfused-madd -mglibc
# -mieee-fp -mmmx -mno-sse4 -mpush-args -mred-zone -msse -msse2
# -mtls-direct-seg-refs
mem_demo_2:
.LFB30:
        .file 1 "main.mem_demo_2.c"
        .loc 1 6 0
        .cfi_startproc
        subq    $8, %rsp
        .cfi_def_cfa_offset 16
        .loc 1 7 0
        movl    $64, %edi
        call    malloc
        .loc 1 8 0
        testb   $1, %al
        .loc 1 7 0
        movq    %rax, %rsi
.LVL0:
        .loc 1 8 0
        movq    %rax, %rdi
        movl    $64, %edx
        jne     .L10
        testb   $2, %dil
        jne     .L11
.L3:
        testb   $4, %dil
        jne     .L12
.L4:
        movl    %edx, %ecx
        xorl    %eax, %eax
.LVL1:
        shrl    $3, %ecx
        testb   $4, %dl
        mov     %ecx, %ecx
        rep stosq
        je      .L5
        movl    $0, (%rdi)
        addq    $4, %rdi
.L5:
        testb   $2, %dl
        je      .L6
        movw    $0, (%rdi)
        addq    $2, %rdi
.L6:
        andl    $1, %edx
        je      .L7
        movb    $0, (%rdi)
.L7:
        .loc 1 10 0
        movq    %rsi, %rax
        addq    $8, %rsp
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        ret
        .p2align 4,,10
        .p2align 3
.L10:
        .cfi_restore_state
        .loc 1 8 0
        leaq    1(%rax), %rdi
        movb    $0, (%rax)
        movb    $63, %dl
        testb   $2, %dil
        je      .L3
        .p2align 4,,10
        .p2align 3
.L11:
        movw    $0, (%rdi)
        addq    $2, %rdi
        subl    $2, %edx
        testb   $4, %dil
        je      .L4
        .p2align 4,,10
        .p2align 3
.L12:
        movl    $0, (%rdi)
        subl    $4, %edx
        addq    $4, %rdi
        jmp     .L4
        .cfi_endproc

Why did you tell it not to inline? The whole point was to measure `memset` performance and you specifically told it not to optimize `memset`. Yeah, with that, they'll both perform badly. They both include a jump to the generic `memset` which makes no assumptions about pointer alignment. The point was to try to get *good* code in at least one case, you got bad in both. — David Schwartz, Dec 19 '12 at 08:06
@David Schwart I also did it with inline enabled. Please see this in my post `gcc -m64 -g -O2 main.c -o main.default` — , Dec 19 '12 at 08:10
I'm not sure why you're seeing different results. I pasted some more details about how I got my results [online](http://pastebin.com/bUd9RVaT). — David Schwartz, Dec 19 '12 at 12:59
@David Schwartz Updated my answer - added assembler for mem_demo_2. It is bigger that yours. — , Dec 19 '12 at 14:39
Hmm, somehow your compiler failed to optimize `mem_demo_2`. Maybe you're older version of GCC doesn't know how to. However, `mem_demo_1` doesn't even give it a chance. — David Schwartz, Dec 19 '12 at 15:10
I compiled the same program with MinGW gcc 4.6.2 on Windows XP. When I compile with `gcc -O3 -g main.c -o main` I don't see any difference between functions. When I compile with `gcc -march=native -O3 -g main.c -o main.native` I get the difference in the number of lines that you are talking about. So, there is no difference when `-march=i386` and there is big difference when `-march=core2` — , Dec 19 '12 at 16:35
The point is really not so much to get or not get a particular optimization, but to not hide information from the compiler and to give it the best chance of making whatever optimizations it can. The code was really just intended as an example. (Imagine if we had this conversation before the core2 came out. You might think it makes no difference at all.) — David Schwartz, Dec 19 '12 at 17:45

Allocating initialized, aligned memory

2 Answers2

Linked

Related