1

The intention of below SystemTap script is to start straceing whenever a process with a given filename is started.

It is called with the following command:

stap  -g -v './sstrace.stp' "$PATTERN"

Where PATTERN can for example be mount.

#!/usr/bin/env stap

# Assign command line parameter to the variable.
@define target_filename %( @1 %)        # The regex the script will trigger on given as CLI parameter

probe begin {
  printf( "Probe starting ...\n" )
  printf( "Try to attach strace upon executing binary (regex) /%s/\n\n" , @target_filename )
}

probe end {
  printf( "Wrapping up ...\n" )
}

probe syscall.execve {
  if ( filename =~ @target_filename ) {
    start_trace( pid() )
  }
}

###
### FUNCTIONS
###

function start_trace( pid ) {
  raise( 19 )
  # Sleeping is bad practice in SystemTap probe, but don't know how to otherwise
  # wait for strace to initialize in time. This will not work as expected when
  # workting interactively. Compare these two results while increasing below sleep
  # to 1 second.
  # $ sudo ./go date
  # $ date; echo hi
  # $ bash -c 'date; echo hi'
  system( sprintf( "strace -f -p %i & sleep 0.01; kill -CONT %i" , pid , pid ) )
}

The idea is that I stop execution (raise( -19 )) of the target process long enough for strace to attach to the process and, then restarting the target process (kill -CONT $TARGET_PID). This often works, but unfortunately intermittent.

When it works, strace kicks in right after execve(), similar to this:

$ strace date 2>&1 | head
execve("/bin/date", ["date"], 0x7ffeb7ce6430 /* 64 vars */) = 0 
brk(NULL)                               = 0x5578de8fa000   <== strace kicks in here.
access("/etc/ld.so.nohwcap", F_O............

Now the problem I really at least want to understand and hopefully solve, is the fact that on some systems I cannot kill -STOP the target process, it simply throws an error along the lines of: kill: process xyz does not. exist.

I know by the time the execve syscall is called, the PID already exists. What I don't understand is why it doesn't seem to obey the SIGSTOP.

Does anyone know why this happens, how to fix the SystemTap script or have an even smarter way to accomplish the goal starting to trace a process on the fly?

jippie
  • 937
  • 5
  • 15
  • 33
  • `on some systems I cannot kill -STOP the target process` - do you mean that the issue is absent on some systems while is reproducible in stable manner on other ones? If yes - can you provide more details about system where you face the issue? Linux distribution, kernel version, stap version, etc. – Danila Kiver Mar 26 '19 at 16:03
  • Just for info: `CentOS 7.6.1810`, kernel `3.10.0`, stap `3.3/0.172` - everything seems to work as expected. – Danila Kiver Mar 26 '19 at 16:06
  • To help debug this, can you add a call to `cat /proc/%i/status` after the `kill` command? – Mark Plotnick Mar 26 '19 at 16:09
  • @MarkPlotnick Currently it works (of course) ant it shows `state: T (stopped)`. Are you interested in specific fields? I cannot copy/paste from the VM at the moment. Let me give it a few shots try to reproduce the problem. – jippie Mar 26 '19 at 17:55
  • Trying to reproduce with the `cat /proc/%i/status` in a while loop ... I'll update when I hit. Usually when it hits once, it stops working for a long time. – jippie Mar 26 '19 at 18:20
  • I'd like to see the state field of the status file, or an error message from cat, in cases where the kill command gets the does-no-exist error. – Mark Plotnick Mar 26 '19 at 19:14
  • Have 2 VM's running non stop for about a week trying to reproduce, but it is apparently an intermittent thing :-( – jippie Apr 02 '19 at 18:32

0 Answers0