Optimizing Stack-Walking performance

Question

Currently i use the dbghelp library to walk through the stack of some process' thread (using GetThreadContext() and StackWalk64()) and collect only the return addresses each frame contains.

However, the overhead of doing so is too big for the systems demands - overall time is apx. 5 msec per stack walk (with 10-15 frames). This time includes the GetThreadContext() and the loop which calls StackWalk64() to get all the frames.

Anyhow, I must find a way to do it much much faster. Anyone has any idea how can i do that?

Edit:

Does anyone know of the ETW (Event Tracing for Windows) mechanism?

If so, how can I trace all the context switches that happened in a certain period of time? Is there an event provider that publishes an event on each context switch?

It isn't exactly meant to be used in a performance-critical fashion. — GManNickG, Dec 06 '11 at 22:05
`CONTEXT` structure filled by `GetThreadContext()` has registers' values. Since you didn't bother to specify processor architecture, the answer would be: "Use this `CONTEXT` structure to walk the stack". For example, on `x86-32` `EBP` is current frame pointer. `EBP+0` is previous frame pointer. `EBP+4` is return address. — lapk, Dec 06 '11 at 23:55
GMan - It is used for profiling a RT system not in a debug mode or so but in it's operational mode. Therefore it is critical to grab this info very fast because at this moment all the system is halted. — Hagay Myr, Dec 07 '11 at 00:07
AzzA - The destined architectures are both X86 and IA64. Will it be faster than using the StackWalk64? Doesn't StackWalk64 do exactly that? — Hagay Myr, Dec 07 '11 at 00:10
`StackWalk64()` does a lot more and it does it in a portable way, independent on how stack appears etc. So, I'd expect it to be slower than getting two DWORD values pointed to by a pointer. However, I would STRONGLY recommend using `StackWalk64()`, since you are targeting different platforms, precisely because it's portable. — lapk, Dec 07 '11 at 00:33
The number of platforms is limited (two to three) so i think it won't be a problem to code it to match all the possible architectures. Is it supposed to be harder/trickier to manually walk the stack on `IA64` than on `X86`? Anyhow, I shall firstly estimate the actual amount of time it takes for a single `StackWalk64()` call and see if it is actually the bottle neck. Thanks AzzA. — Hagay Myr, Dec 07 '11 at 00:45
I was about to suggest that you time `StackWalk64()` itself. Are you sure your slow times do not come from a lot of thread context switching? In principle, if you need only return address for each stack frame, it shouldn't be hard. On `x86-32` it's as simple as getting a `DWORD` value. — lapk, Dec 07 '11 at 00:52
@HagayMyr: You might have better luck using something intended to be used for profiling, to lower the overhead. (Also, use `@name` to reply.) — GManNickG, Dec 07 '11 at 03:15
@AzzA: I'll test it and return with some answers (hopfully). Please see my edit of my question... — Hagay Myr, Dec 07 '11 at 10:52
@GMan: Can you please elaborate a bit with your suggestion? Please see my edit of my question. Might it be what you have meant for? — Hagay Myr, Dec 07 '11 at 10:54

score 4 · Answer 1 · answered Dec 06 '11 at 22:53

4

The fastest way that I can think of is to create your own version of GetThreadContext and StackWalk64 by creating a kernel driver that grabs the kernelStack field of ETHREAD structure of the thread your trying to monitor. Here is a good article on this subject.

answered Dec 06 '11 at 22:53

JosephH

8,465
4
34
62

1

I read the article you linked. I did entirely understand what you suggested. Does any user thread has a ETHREAD structure? My intention is to Break a process run and iterate over all its threads (which i don't know their IDs before) to grab their stacks. How can i use/build a kernel driver to fulfill this task? – Hagay Myr Dec 07 '11 at 00:17
2

Yes, they all have ETHREAD structure. You should first start with getting the list of threads of a particular process. **However**If you are not comfortable writing kernel driver(imho accessing ETHREAD itself is pretty dangerous since the structure differs between OS versions), I'd rather just stick with the user mode code because you're trading off between performance and stability. Otherwise write ring-0 code as much as you can because more code you write in the kernel, the better performance you'll get. – JosephH Dec 07 '11 at 03:51
1

I pretty much have zero experience in writing kernel drivers on Windows. If you say it also might be risky than I'll use your advise and stick to the user-mode code. Please see my edit of my question, maybe you could help with that... – Hagay Myr Dec 07 '11 at 10:57

score 2 · Answer 2 · answered Jan 20 '12 at 08:09

If you're on Windows Vista or higher, you should use ETW, period. You can activate all what you're talking about, including Context Switches and Sample Profile events, and it's pretty efficient. For X86, it's basically walking the EBP register chain, which is a linked list of addresses that it needs to iterate over. In 64-bit land, the stack walker has to unwind the stack, and so it's a little less efficient, but I can tell you if you're doing any reasonable amount of work in your application, the effects of stack walking will not show up. It's certainly not in the millisecond range.

score 1 · Answer 3 · answered Dec 07 '11 at 18:31

1

The ETW part is actually an independent question. Windows Performance Analysis Tools can capture all context-switches, as well as Visual Studio Profiler in "Resource Contention Concurrency Profiling" mode. You can also dump all events into file manually using logman, see the instructions here.

answered Dec 07 '11 at 18:31

Uri Cohen

3,488
1
29
46

Agree, but it came in the same context (my question) which is trying to find an efficient (and cheap) solution to profile a program on an already deployed system (not in our hands and not in debug mode). I read the article you linked about the ETW. I wonder how can one do these context-switches tracings programmatic? To be clear, I need to write a program (not in a .NET env.) that collects the call-stacks of the switching threads at all of the context-switch moments (in a certain period of time, say last 3 minutes). Where shall i start? – Hagay Myr Dec 07 '11 at 20:24
1

I would use the tools I gave you to collect the data, then process it using custom code. Collecting it 'yourself' is just not cost-effective. So use logman or Xperf or VS profiler to create a trace file, then use code to parse it in batch (several formats are available). – Uri Cohen Dec 08 '11 at 11:15

Optimizing Stack-Walking performance

Edit:

3 Answers3