3

Right, this is extremely obscure...

So on Windows, when you hit control-C to interrupt a console program, this sends a CTRL_C_EVENT to the process. You can also do this manually via GenerateConsoleCtrlEvent. In Python, os.kill acts as a wrapper around the C-level GenerateConsoleCtrlEvent, and allows us to send a CTRL_C_EVENT to the current process by doing:

os.kill(os.getpid(), signal.CTRL_C_EVENT)

However, this doesn't just go to the current process – it actually goes to the whole "process group" that this process is a part of.

I have a test suite which calls os.kill like you see above, as part of some tests to make sure my program's control-C handling works correctly. When running this test suite on appveyor, though, this causes a problem, because apparently some of appveyor's infrastructure is in the same "progress group" and gets broken.

The solution is that we need to spawn the test suite with the CREATE_NEW_PROCESS_GROUP flag set, so that its CTRL_C_EVENTs don't "leak" to the parent. That's easily done. BUT...

If I use CREATE_NEW_PROCESS_GROUP and run the child script using python whatever.py, then it works as expected: the CTRL_C_EVENT is confined to the child.

If I use CREATE_NEW_PROCESS_GROUP and run the child script using py whatever.py (i.e., using the python launcher, which is supposed to be equivalent to running python directly), then the CREATE_NEW_PROCESS_GROUP seems not to have any effect: the CTRL_C_EVENT affects the parent as well!

Here's a minimal sample program that just uses os.kill on itself and then checks that it worked (minor wrinkle: CREATE_NEW_PROCESS_GROUP sets CTRL_C_EVENT to be ignored in child processes, so there's a bit of fluff here using SetConsoleCtrlHandler to un-ignore it): https://github.com/njsmith/appveyor-ctrl-c-test/blob/master/a.py

Here's the wrapper script I use to run the above program: https://github.com/njsmith/appveyor-ctrl-c-test/blob/master/run-a.py

If the wrapper script runs python a.py, then everything works. If the wrapper script runs py a.py, then the wrapper script receives a KeyboardInterrupt.

So my question is: what the heck is going on here? What is the py launcher doing differently from python that causes the CTRL_C_EVENT to "leak" into the parent process, even though it's in a different process group? And how is that even possible?

(I originally discovered this because running pytest a.py acts like py a.py, i.e. is broken, but python -m pytest a.py works, presumably because the pytest entry point uses the py launcher.)

Nathaniel J. Smith
  • 11,613
  • 4
  • 41
  • 49
  • According to the documentation (see the first link in your question) you can't specify a process group when sending a CTRL+C signal, i.e., `dwProcessGroupId` always has to be zero, which sends the signal to every process that share the console of the calling process. So it is the call to `python` that is behaving oddly, not the call to `py`. :-) – Harry Johnston Feb 11 '17 at 22:13
  • Huh, true. That's clearly not what actually happens, though. Reading it again though, I realized that what *might* explain this is if `dwProcessGroupId=0` means send `CTRL_C_EVENT` to all processes on the console, regardless of process group id, *and* if you set `dwProcessGroupId` to the PID of a process that *isn't* a group leader, it acts like setting it to 0. Then the special thing about `py` would be that we make `py` a group leader, but `py` spawns `python` to run the code, so `python` isn't a group leader. – Nathaniel J. Smith Feb 11 '17 at 23:14
  • @eryksun: my current guess is that the bug is that `GenerateConsoleCtrlEvent` treats an invalid group id as it were 0 (meaning "all groups"). I agree that `os.kill`'s way of wrapping it is pretty confusing, but that's not the issue. – Nathaniel J. Smith Feb 14 '17 at 07:19
  • @eryksun: Oh interesting, thanks! I'll accept if you post that as an answer. …I've actually given up on this for now anyway because even when I do everything "right" (either properly passing a group leader or spawning a new console), then I'm still getting random hangs; I call `GenerateConsoleCtrlEvent` but then no signal arrives. Weirdly, this seems to go away if I increase debugging output, almost as if printing more stuff to the console makes events be delivered more reliably? I'm not even sure how to phrase this as a question, but throwing this out there in case it rings a bell for you… – Nathaniel J. Smith Feb 14 '17 at 08:03

1 Answers1

2

Every process is in a process group. It either inherits its parent's group, or it gets created as the leader of a new group via the CREATE_NEW_PROCESS_GROUP creation flag. As far as I know, GenerateConsoleCtrlEvent is the only API that uses process groups, and there's no API to query the process group ID. You could grab it from the ProcessParameters in the PEB, but that involves using undocumented internal structures. No one should do that, so GenerateConsoleCtrlEvent should only send to either group 0 to broadcast an event or to a child process that you know was created as a new group.

The problem that you've uncovered here is that sending an event to a process that's attached to the console but not the leader of a group gets silently handled as if the target were group 0. Specifically in your case you're starting py.exe as the group leader and then trying to send CTRL_C_EVENT to python.exe, i.e. os.getpid(). You'd have to send to os.getppid() in this case.

This problem is common with Python scripts on Windows because of the confusing implementation of os.kill. It conflates process IDs and process group IDs. It would have been less confusing had GenerateConsoleCtrlEvent been used for os.killpg (currently not implemented on Windows) and TerminateProcess alone used for os.kill.

Experiencing random hangs when calling os.kill(os.getpid(), signal.CTRL_C_EVENT); time.sleep(10) may be due to a race condition. time.sleep is implemented by pysleep in Modules/timemodule.c. On Windows, when called in the main thread, it waits on an event that gets set by the signal handler for SIGINT (but not SIGBREAK for some reason). The possible race here is that pysleep resets the event before waiting on it. The signal handler executes on a new thread, and occasionally it may have already set the event before pysleep resets it. This is conceivable since executing CPython bytecode is relatively slow. That said, I'd expect it to be a close race because there are a lot of steps involved to create the control handler thread, as the following overview shows for Windows 10.

1. Console Client -- Main Thread (python.exe)
   kernelbase!GenerateConsoleCtrlEvent
   kernelbase!ConsoleCallServer
   ntdll!NtDeviceIoControlFile

2. Device Driver (condrv.sys)
   condrv!CdpDispatchDeviceControl

    3a. Console Server (conhost.exe)
        ntdll!NtDeviceIoControlFile
        conhostv2!SrvGenerateConsoleCtrlEvent
        conhostv2!ProcessCtrlEvents
        user32!ConsoleControl
        ntdll!CsrClientCallServer
        ntdll!NtRequestWaitReplyPort (ALPC)

    3b. Windows Server (csrss.exe)
        ntdll!NtAlpcSendWaitReceivePort
        winsrv!SrvEndTask
        winsrv!CreateCtrlThread
        ntdll!NtCreateThreadEx (Control Thread)

    3c. Console Client -- Control Thread (python.exe)
        kernelbase!CtrlRoutine
        ucrtbase!ctrlevent_capture (Emulate SIGINT)
        python36!signal_handler
        kernelbase!SetEvent (SIGINT Event -- Race with step 4)

4. Console Client -- Main Thread (python.exe)
   python36!pysleep
   kernelbase!ResetEvent (SIGINT Event -- Race with step 3c)
   kernelbase!WaitForSingleObjectEx (SIGINT Event)
   python36!PyErr_CheckSignals       
   python36!signal_default_int_handler

If the race condition is the problem, then giving time.sleep enough time to reset the event before GenerateConsoleCtrlEvent gets called should resolve it. Try calling os.kill using a threading.Timer with a small delay.

Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
  • Thanks, this is a great answer! And good point about the race condition – I'm not sure if that's what I was hitting, but I [filed a bug](https://bugs.python.org/issue30151) anyway :-). I can imagine that I could have been if it was triggering some heuristics in the scheduler or something... I'm able to reliably hit [this race condition](https://bugs.python.org/issue30038) b/c apparently writing to a socket from the signal handler that another thread is `select`ing on deterministically switches control to that other thread... – Nathaniel J. Smith Apr 24 '17 at 07:18