0

Edit: I want to test the system by inserting a breakpoint and comparing memory before and after the breakpoint.
I used static analysis to get a list of C source code locations and debugging information (ie, a dwarf) provides a mapping between C source code and machine instructions in executable.
But the problem is that there are many machine instructions that mapped to one line of C source code and I need to test all of them.
The machine instruction to be tested is to modify the memory state. So I want to reduce the number of instruction by eliminating the instruction that doesn't modify the memory.

For example, I have the following source code test.c and I have the line number 5.

2   int var1 = 10;
3   void foo() {
4       int *var2 = (int*)malloc(sizeof(int));
5       for(*var2=var1;;) {
6       /* ... */
7       }
8   }

To be clear, line number 5 accesses the global memory var1 and the heap memory *var2.

I compiled the above program with the command gcc -g test.c and the result is

(a.out)
00000000004004d6 <foo>:
  4004d6:   55                      push   %rbp
  4004d7:   48 89 e5                mov    %rsp,%rbp
  4004da:   48 83 ec 10             sub    $0x10,%rsp
  4004de:   bf 04 00 00 00          mov    $0x4,%edi
  4004e3:   e8 d8 fe ff ff          callq  4003c0 <malloc@plt>
  4004e8:   48 89 45 f8             mov    %rax,-0x8(%rbp)
  4004ec:   8b 15 1e 04 20 00       mov    0x20041e(%rip),%edx        # 600910 <var2>
  4004f2:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  4004f6:   89 10                   mov    %edx,(%rax)
  4004f8:   eb fe                   jmp    4004f8 <foo+0x22>

and dwarfdump -l a.out give me the following result.

0x004004d6  [   3, 0] NS uri: "/home/workspace/test.c"
0x004004de  [   4, 0] NS
0x004004ec  [   5, 0] NS
0x004004f8  [   5, 0] DI=0x1

Now I know that, in the a.out, the location 0x4004ec, 0x4004f2, 0x4004f6 and 0xf004f8 are mapped to the line number 5 in C source code.
But I want to exclude the 0x4004f8 (jmp) which doesn't access the (heap, global or local) memory.

Does anyone know how to get only instructions that access memory?

Dae R. Jeong
  • 105
  • 11
  • 1
    Do you want to include implicit memory operands like the stack for push/pop and call/ret? (And also `rep movs` or `rep stos`, which gcc will inline sometimes.) Intel-syntax disassembly might be handy, because all explicit memory operands will have `ptr` in them, like `mov rax, qword ptr [rbp - 0x8]`, so you can text search. – Peter Cordes Oct 22 '17 at 07:07
  • @PeterCordes I don't want to include implicit memory operands. I think your answer is exactly what I want to know. Especially, I was not sure that all memory operands have `ptr`. Thanks a lot!! – Dae R. Jeong Oct 22 '17 at 07:25
  • 2
    In asm source, the ` ptr` syntax isn't required, but disassemblers like `objdump -drwC -Mintel` are explict. In AT&T syntax, you could also just look for `()` or a bare symbol name as an operand. Oh, don't forget to filter out `lea` instructions. `lea` is like the `&` operator in C. It's a shift-and-add instruction that uses memory-operand syntax and machine encoding. – Peter Cordes Oct 22 '17 at 07:27
  • 1
    **Why** do you ask? Please **edit your question** to motivate it and improve it! – Basile Starynkevitch Oct 22 '17 at 07:58
  • @AnttiHaapala: indirect jumps with a register source only reference memory for code-fetch, same as direct jumps and calls. Memory-indirect jumps use `*0x1234(%reg)` or `qword ptr []`, so my suggestion to search for `()` or `ptr` covers that. – Peter Cordes Oct 22 '17 at 08:03
  • 1
    Smells badly like some [XY problem](http://xyproblem.info). You would get much better answers if you motivated your question and explained your overall goals. – Basile Starynkevitch Oct 22 '17 at 08:10
  • @PeterCordes ah true that :D I need my morning coffee. – Antti Haapala -- Слава Україні Oct 22 '17 at 08:11
  • Without edits (giving additional motivations), the question is unclear, and I voted to close it. – Basile Starynkevitch Oct 22 '17 at 08:19
  • @BasileStarynkevitch Sorry for the late response. I don't speak English... I edited the question. Still having a problem? – Dae R. Jeong Oct 22 '17 at 08:48
  • Yes, still too broad question. In particular, how much time can you afford spending (several years full-time for a PhD level work)? Do you want some general solution, or are you focussing only on debugging one particular program (then consider `gdb` watchpoints and scripting). BTW I am not a native English speaker (I'm French). I still don't understand your real goals! – Basile Starynkevitch Oct 22 '17 at 08:51
  • Is your work a PhD on compilation or static analyis, or are you just facing a difficult debugging issue on some concrete but specific C program? – Basile Starynkevitch Oct 22 '17 at 08:55
  • I am impatiently waiting for an additional motivation in your question. – Basile Starynkevitch Oct 22 '17 at 09:05
  • Your question remains unclear, and I even asked a [meta-question](https://meta.stackoverflow.com/q/358255/841108) related to it. – Basile Starynkevitch Oct 22 '17 at 09:46
  • But do you seek a particular solution to a given obscure bug (which you should have explained in your question, but did not) in some particular code base, or do you want a more ambitious and more general (and reusable) solution? – Basile Starynkevitch Oct 22 '17 at 09:59
  • 1
    I'm in PhD course, but I'm just facing a difficult debugging issue. – Dae R. Jeong Oct 22 '17 at 10:01
  • That debugging issue should go into and be detailed in your question. Probably you just need `gdb` watchpoints or [valgrind](http://valgrind.org/) or address sanitizers, and you don't really need to care about actual machine instructions doing memory accesses. So you have a strong typical [XY problem](http://xyproblem.info) – Basile Starynkevitch Oct 22 '17 at 10:03

2 Answers2

4

This is only answering the question about finding asm instructions with explicit memory operands. The part about associating them with C statements is pretty bogus outside of -O0 compiler output (where each statement is compiled to a separate block of instructions to support GDB's jump to another line in the same function, or modifying variables in memory while stopped at breakpoint). See Basile's answer which tries to make some sense of the C statement stuff in the question.


Intel-syntax disassembly might be handy, because all explicit memory operands will have ptr in them, like mov rax, qword ptr [rbp - 0x8], so you can text search.

In asm source, the <size> ptr syntax isn't required when a register operand implies the operand size, but disassemblers like objdump -drwC -Mintel always put it in.

In AT&T syntax, you could also just look for () or a bare symbol name as an operand.

Don't forget to filter out lea instructions. lea is like the & operator in C. It's a shift-and-add instruction that uses memory-operand syntax and machine encoding.

Also don't forget to filter out various long-nop instructions that use addressing modes to get the right amount of padding in one instruction. For example:

66 2e 0f 1f 84 00 00 00 00 00   nop    WORD PTR cs:[rax+rax*1+0x0]

So if the mnemonic is lea or nop, ignore the instruction. (32-bit code sometimes uses other instructions as NOPs, but usually it's actually an lea that sets a register to itself in machine code generated by gas / ld from compiler .p2align directives.)


objdump disassembles rep stos with explicit operands, like rep stos QWORD PTR es:[rdi],rax. So you will actually get rep movs and rep stos operands. (Note that rep movs and rep cmps have two memory operands, unlike normal instructions. They're implicit in the machine code, but objdump makes them explicit.) This will also miss implicit memory operands like the stack for push / pop and call / ret.

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The part about filtering `lea` is a good idea. But what about "long nop"? I think you should specify an example, because filtering them out is critical for correctness of this answer. – anatolyg Oct 22 '17 at 08:53
  • @anatolyg: Good idea. objdump uses the `nop` mnemonic for long-NOP instructions, so it should be easy. Am I missing anything else? – Peter Cordes Oct 22 '17 at 09:07
3

A given C statement is compiled into several machine instructions, and several of them may access memory. Think of something like ptr->fld = arr[i++] * arr[j]--; .... BTW, in some cases, arr[j] might have been used earlier, could already sit in some register, so might not need another memory load (but only a store, which could be defered later).

I want to know the location, in executable, of the machine instruction that accesses (heap, global or local) memory generated by the given code

So your question might not make sense in general. Several machine instructions (or none of them) might access memory (related to a single C statement in your source code). And register allocation and register spilling may happen, so a given machine instruction might be related to a C variable quite far from the "current" C instruction (which has no sense).

An optimizing compiler is allowed to mix the several C statements and might output intermixed machine code. Read also about sequence points. There is no obvious mapping between machine code instruction and C statement (notably with optimizations enabled), that is why you often debug with less optimizations enabled (so gcc -g prefers to be used with -O0 or -Og, not more).

With GCC compile your src.c source file using

gcc -O -S -Wall -fverbose-asm src.c

and you'll get a slightly more readable src.s assembler file. You could use some editor or pager to look into that generated file.

Does anyone know how to get only instructions that access memory?

That does not make much sense. An optimizing compiler would sometimes share some common machine code related to several different C statements.

BTW, you might also ask GCC to dump various internal representations, for example using gcc -O -fdump-tree-all ; then you get hundreds of (textual) internal dump files (partially dumping various internal representations). Remember that GCC has hundreds of optimization passes.

Notice you might be more interested to work on GCC internal representations (e.g. GENERIC or GIMPLE or even RTL) by adding your own GCC plugin (or GCC MELT extensions). That could require months of work (notably to undestand details of GCC internal architecture and representations).

Without understanding your high-level goals and motivations, we cannot help you more.

You should read much more about semantics and about undefined behavior, which is (indirectly) more relevant to your question than what you believe.

Notice that C statements do not correspond (one to many) to machine instructions. An optimizing compiler don't compile C statements one by one, it compiles an entire translation unit at once (and may for example do inline expansions, loop unrolling, stack unwinding, constant folding, register allocation and spilling, interprocedural optimizations and dead code elimination). This is why C compilers are so complex beasts of many millions of source code lines. BTW, most C compilers (e.g. GCC or Clang) are free software, so you can spend several months or years studying their source code.

Read also some good book on compilers (e.g. the latest Dragon Book), some books on semantics, and on programming languages pragmatics.

If you are interested by GCC internals specifically, my documentation page (also available here) of GCC MELT contains lots of slides and references.

If you only care about machine instructions, you might entirely forget about C and work, with the help of some dissassembler library like libopcode (see this), only on machine code in object files.

Look also into other static source code analyers, including Coccinelle & Frama-C and libclang.

If you are interested only by GCC emitted code and can afford recompiling your C source code, you might instead work inside the GCC compiler (thru your GCC plugin or GCC MELT extension) at the GIMPLE level and detect (and perhaps transform) those GIMPLE instructions accessing memory. Detecting (and perhaps transforming) GIMPLE statements modifying memory could be simpler and might be enough.

I want to test the system by inserting a breakpoint and comparing memory before and after the breakpoint.

This is a bit similar to e.g. address sanitizers and other instrumentation features of GCC. You could spend several years working on something similar (and transforming some GIMPLE), then you probably want to add several additional passes in GCC (and you might need some extra runtime support).

Notice however that recent GDB is scriptable (in Guile or Python) and has watchpoints. If you just want to debug one particular program, that might be enough (and you might not need to dive into compiler internals, which would take many months or years of work). You should also use valgrind and address sanitizers.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • I know little about the optimization. I know when register allocation happens, a compiler might decide not using memory. I also know that there is the unlimited number of machine instructions. For example `a=b; c=d; d=e; ...` in just one line. That's okay. I just want to know all instructions that access memory. It doesn't matter how many they are. – Dae R. Jeong Oct 22 '17 at 08:01
  • 1
    But **why** do these instruction accessing memory matters to you? What about `PREFETCH` machine instructions? Do they count as "memory access"? And the compiler sometimes optimizes by emitting some machine instruction used for several different C statements. So without additional explanation, your question has no sense. And register spilling instructions are not corresponding to some precise C statement! – Basile Starynkevitch Oct 22 '17 at 08:02
  • The question only asks about *instructions*. All the fluff about being associated with C statements is kind of beside the point (although it's definitely a sign of a possible X-Y problem, and the answer to this question might not actually be what the OP needs, so upvoted for taking the time to write that up.) – Peter Cordes Oct 22 '17 at 08:05
  • Please let me think about `PREFETCH`. But as you mentioned, sharing machine instructions among multiple C statements is the problem. How are they (shared machine instructions ) represented in dwarf format? Will GCC map all the location of C statements to the machine instruction? – Dae R. Jeong Oct 22 '17 at 08:10
  • Please **edit your question** by adding a few paragraphs explaining your overall goals and motivations. Without them, you won't get any valuable help. We can only guess (and often wrongly) what you want to do. Or you are very confused and naive about the role of a compiler and the [semantics](https://en.wikipedia.org/wiki/Semantics_(computer_science)) of programs. – Basile Starynkevitch Oct 22 '17 at 08:11
  • Again, your question needs additional motivation and context. – Basile Starynkevitch Oct 22 '17 at 08:40