6

I have a very frustrating problem. My application runs on a few machines flawlessly for a month. However, there is one machine on which my application crashes nearly every day because of segfault. It always crashes at the same instruction address:

segfault at 7fec33ef36a8 ip 000000000041c16d sp 00007fec50a55c80 error 6 in myapp[400000+f8000]

This address points to memcpy call.

Below, there is an excerpt #1 from my app:

....
uint32_t size = messageSize - sizeof(uint64_t) + 1;

stack->trcData = (char*)Realloc(stack->trcData,(stack->trcSize + size + sizeof(uint32_t)));
char* buffer = stack->trcData + stack->trcSize;

uint32_t n_size = htonl(size);
memcpy(buffer,&n_size,sizeof(uint32_t)); /* ip 000000000041c16d points here*/
buffer += sizeof(uint32_t);

....
stack->trcSize += size + sizeof(uint32_t);
....

where stack is a structure:

struct Stack{
  char*     trcData;    
  uint32_t  trcSize;    
  /* ... some other elements */
};

and Realloc is a realloc wrapper:

#define Realloc(x,y)    _Realloc((x),(y),__LINE__)

void* _Realloc(void* ptr,size_t size,int line){

  void *tmp = realloc(ptr,size);
  if(tmp == NULL){
    fprintf(stderr,"R%i: Out of memory: trying to allocate: %lu.\n",line,size);
    exit(EXIT_FAILURE);
  }
  return tmp;
}

messageSize is of uint32_t type and its value is always greater than 44 bytes. The code #1 runs in a loop. stack->trcData is just a buffer which collects some data until some condition is fulfilled. stack->trcData is always initialized to NULL. The application is compiled with gcc with optimization -O3 enabled. When I run it in gdb, of course it did not crash, as I expected;)

I ran out of ideas why myapp crashes during memcpy call. Realloc returns with no error, so I guess it allocated enough space and I can write to this area. Valgrind

valgrind --leak-check=full --track-origins=yes --show-reachable=yes myapp

shows absolutely no invalid reads/writes.

Is it possible that on this particular machine the memory itself is corrupted and it causes these often crashes? Or maybe I corrupt memory somewhere else in myapp, but if this is the case, why it does not crash earlier, when the invalid write is made?

Thanks in advance for any help.

Assembly piece:

41c164: 00 
41c165: 48 01 d0                add    %rdx,%rax
41c168: 44 89 ea                mov    %r13d,%edx
41c16b: 0f ca                   bswap  %edx
41c16d: 89 10                   mov    %edx,(%rax)
41c16f: 0f b6 94 24 47 10 00    movzbl 0x1047(%rsp),%edx
41c176: 00

I'm not sure whether this information is relevant but all the machines, my application runs on successfully, have Intel processors whilst the one causing the problem has AMD.

  • How/where do you set `stack->trcData` initially? How/where is `messageSize` set? Your segfault could be due to a memory management bug in your code, but you don't have enough pieces here to determine that. – lurker Sep 03 '13 at 12:22
  • 4
    I wouldn't rule out faulty hardware. Have your system administrators run a heavy duty memory test on the computer where your code crashes, and see if they could tell you anything interesting. – Sergey Kalinichenko Sep 03 '13 at 12:23
  • @mbratch `stack->trcData` is set to `NULL` initially. A value is assigned to `messageSize` and it's always checked. –  Sep 03 '13 at 12:25
  • @dasblinkenlight He plans to run a memory test but not very soon. –  Sep 03 '13 at 12:28
  • @DariuszSendkowski: what does the disassembly in that area look like? I don't see why a call to `memcpy` would crash at the call site. – nneonneo Sep 03 '13 at 12:32
  • Then you shouldn't plan to provide a fix "very soon" either - it's a good idea to ensure that you aren't embarking on a wild goose chase before you begin. If valgrind says you're good, the search would be very costly. – Sergey Kalinichenko Sep 03 '13 at 12:33
  • In your code #1 `Realloc()` is called with only two instead of three parameters. Is that the case in the original code as well? – Ingo Leonhardt Sep 03 '13 at 12:39
  • @Ingo Leonhardt Sorry, `Realloc` is a macro. I've just edited my post. –  Sep 03 '13 at 12:43
  • 1
    Are you sure you have a prototype of `void *_Realloc()` in your code? Thanks to cast you have made, the code would compile without as well. But on some 64bit architectures you would only store the last four bytes of the eight byte address in `stack->trcData` – Ingo Leonhardt Sep 03 '13 at 12:49
  • @Ingo Leonhardt Yes, I have the prototype of `_Realloc` in the code. –  Sep 03 '13 at 12:52
  • @Ernest Friedman-Hill I call `htonl` since `stack->trcData` is sent to another application over network eventually. On the other side of communication, the size is decoded by calling `ntohl`. –  Sep 03 '13 at 12:56
  • 1
    Is it possible that at some point, `messageSize` has a value making `stack->trcSize + size + sizeof(uint32_t) = 0` ? Making realloc returning NULL ? (By exemple with `messageSize = 4` and `trcSize = 0`, if my calculation are correct...) – NoWiS Sep 03 '13 at 14:29
  • Have you tried monitoring the code on other machines to make sure the code at this location is executed OK elsewhere? Do any other applications crash on the machine where this one does? If none of the other machines running this code actually execute it, then it doesn't necessarily point to the hardware; if all the other machines do execute this same code flawlessly, then it supports the 'machine at fault' contention. If other applications are failing on the same machine for a similar reason, that supports 'machine at fault'; if no other application runs into the problem, maybe not. – Jonathan Leffler Sep 03 '13 at 14:37
  • @Jonathan Leffler This piece of code is one of the most frequently called pieces in the whole application. This problem occurs only on a single, particular machine. –  Sep 03 '13 at 15:03
  • @NoWiS No, it is not possible. `messageSize` is always greater than 44 bytes. Its value is always checked before `Realloc` call. –  Sep 03 '13 at 15:05
  • 1
    What are the values of `stack->trcData` before and after the `Realloc()` call in an instance where it crashes? What is the value of the `rax` register when it crashes? What are all of the regions of memory mapped into your program when it crashes (`cat /proc//maps`)? – Adam Rosenfield Sep 03 '13 at 23:01
  • It might be worth trying either an alternate malloc implementation (e.g. TC malloc) or see if your existing malloc has any diagnostics that might uncover problems: http://www.gnu.org/software/libc/manual/html_node/Heap-Consistency-Checking.html – Drew MacInnis Sep 04 '13 at 02:59
  • To answer your question about an invalid write elsewhere in the program - it absolutely can lead to this, and the reason it does not crash is that it's writing to a valid location in memory, just the wrong location. Seg faults are caught by the kernel when the hardware tells the kernel the process accessed a memory location for which it's memory map does not have an entry. A memory checker could help; glibc has one built in. – ash Sep 04 '13 at 06:29
  • To enable glibc's checker, set the environment variable `MALLOC_CHECK_` to 1 (errors go to stderr), 2 (error calls abort()), or 3 (error is printed to stderr and calls abort(). – ash Sep 04 '13 at 06:32
  • Is stack->trcSize appropriately updated elsewhere in the code? – Claudi Sep 08 '13 at 14:52
  • @ash Enabling `MALLOC_CHECK_` gave no extra information. The application crashed exactly the same as before. –  Sep 08 '13 at 16:39
  • @Claudix The size is updated within the same block. –  Sep 08 '13 at 16:42
  • @DariuszSendkowski - ah, that means that memory allocation did not detect the error. Perhaps it's not related to malloc and free... – ash Sep 09 '13 at 04:54
  • Have you tried an own version of memcpy, i.e., just copying byte-by-byte in a loop? It's only for discarding a possible `memcpy` malfunction. Even better, replace the memcpy line by this statement: `*((uint32_t*)buffer) = htonl(size)` – Claudi Sep 09 '13 at 07:34
  • I think I know, what can cause this situation. Suppose, that at some loop step `stack->trcSize + size` exceeds `UINT32_MAX`. That means `Realloc` in fact shrinks `stc->trcData`. Next, I define `buffer` which now is far behind the allocated area. Hence, when I write to buffer I get segfault. What do you think? –  Sep 11 '13 at 11:00

1 Answers1

0

Here is the cause of my problem. The point is that at some loop step stack->trcSize + size exceeds UINT32_MAX. That means Realloc in fact shrinks stc->trcData. Next, I define buffer which now is far behind the allocated area. Hence, when I write to buffer I get segfault. I've checked it and it was indeed the cause.