1

We have an NDIS LWF driver, and only on very few systems, we get IRQL_UNEXPECTED_VALUE BSOD on the NdisFIndicateReceiveNetBufferLists, But we do not raise or lower IRQL in any part of the code, and the NdisFIndicateReceiveNetBufferLists is called in the irp_mj_device_control callback. We also check the IRQL and if its DISPATCH, we set the last argument to NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL, and 0 otherwise, could this be the issue?

I also found this article:

https://knowledge.broadcom.com/external/article/164146/crash-with-bug-check-0xc8-after-installi.html

They had a similar issue, and the issue seems to be that there was another NDIS driver raising the IRQL to DISPATCH_LEVEL and forgeting to lower it? But I'm still not sure if this is applicable to our issue or not? Could this be also our issue?

IRQL_UNEXPECTED_VALUE (c8)
The processor's IRQL is not what it should be at this time.  This is
usually caused by a lower level routine changing IRQL for some period
and not restoring IRQL at the end of that period (eg acquires spinlock
but doesn't release it).
Arguments:
Arg1: 0000000000020002, (Current IRQL << 16) | (Expected IRQL << 8) | UniqueValue
Arg2: fffff82621a444f0, Depends on UniqueValue:
    If UniqueValue is 0 or 1: APC->KernelRoutine.
    If UniqueValue is 2: the callout routine
    If UniqueValue is 3: the interrupt's ServiceRoutine
    If UniqueValue is 0xfe: 1 iff APCs are disabled
Arg3: ffff950cf4dccff0, Depends on UniqueValue:
    If UniqueValue is 0 or 1: APC
    If UniqueValue is 2: the callout's parameter
    If UniqueValue is 3: KINTERRUPT
Arg4: 0000000000000000, Depends on UniqueValue:
    If UniqueValue is 0 or 1: APC->NormalRoutine

Call stack:

nt!KeBugCheckEx
nt!KeExpandKernelStackAndCalloutInternal
nt!KeExpandKernelStackAndCalloutEx
ndis!ndisInvokeNextReceiveHandler
ndis!ndisFilterIndicateReceiveNetBufferLists
ndis!NdisFIndicateReceiveNetBufferLists
OurNdis

And the second parameter which is the callout routine (based on unique value), is ndis!ndisDataPathExpandStackCallback.

Edit1:

I did a little more digging, and indeed it seems like ndisDataPathExpandStackCallback actually just calls ndisCallReceiveHandler (which doesn't appear on the stack). and I assume this just indicates the recved NBL to other NDIS drivers? Anyways, ndisDataPathExpandStackCallback is called via KeExpandKernelStackAndCalloutInternal, and the latter stores the IRQL, and checks the IRQL after the call, and if it mismatches, it raises this bugcheck, bingo!

BUT, now my question is, how can i find the faulty driver? Can i somehow use the ndiskd extension to help me which NDIS driver did the KeExpandKernelStackAndCalloutInternal call so i can prove and find the faulty driver?

Although by investigating the stack, i did find pacer!PcFilterReceiveNetBufferLists, but i doubt this is the faulty driver considering its a windows driver, right?

OneAndOnly
  • 1,048
  • 1
  • 13
  • 33

1 Answers1

1

They had a similar issue, and the issue seems to be that there was another NDIS driver raising the IRQL to DISPATCH_LEVEL and forgeting to lower it? But I'm still not sure if this is applicable to our issue or not? Could this be also our issue?

That particular bugcheck means that someone leaked the IRQL during the code that has already unwound off the stack. KeExpandKernelStackAndCalloutInternal is doing something like this:

oldIrql = KeGetCurrentIrql();
(*callback)(...);
newIrql = KeGetCurrentIrql();

if (oldIrql != newIrql) {
    KeBugCheckEx(IRQL_UNEXPECTED_VALUE, (newIrql << 16) | (oldIrql << 8) | 2, ...);
}

Decoding the first argument, that means the IRQL was PASSIVE_LEVEL on entry, and DISPATCH_LEVEL on exit.

Unfortunately, the code that did this has already finished running -- this bugcheck is just identifying that they didn't clean up the place before they left the room. You can make an educated guess as to what code was likely running by looking at the filter driver stack in !ndiskd.miniport. But this only gives you a starting place: depending on what packets were coming in from the network, the network stack could have called out into a variety of drivers. E.g., if the network indicated up an SMB3 packet, then execution actually winds its way all the way up through the filesystem stack. So it's not particularly easy to list out all the possible drivers that could have run.

One thing to check, though, is that you are using the NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL flag correctly. You are only allowed to set the flag if you are certain that the IRQL is currently DISPATCH_LEVEL. If that flag is used incorrectly, you might be able to trick some other driver into mismatching the IRQL. For example, a hypothetical driver might have:

void FilterReceiveNbls(..., ULONG ReceiveFlags) {
    KIRQL oldIrql;
    KeRaiseCurrentIrql(DISPATCH_LEVEL, &oldIrql);

    . . . do stuff at dispatch level . . .

    if (0 == (ReceiveFlags & NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL)) {
        KeLowerCurrentIrql(oldIrql);
    }
}

I'm not saying with certainty that's exactly what happened. I'm just looking for things you can audit in your driver, and correct use of NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL is one of them. Note that it's always correct to not add this flag to ReceiveFlags. (In fact, it's even correct to just clear the flag if you see someone else set it -- the flag's only benefit is a very tiny perf optimization.) So if you're ever in doubt, just don't add the flag.

Windows 11 can strictly verify this flag if you enable Driver Verifier (DV) with the NDIS/WIFI option enabled. The easiest thing to do is to enable DV on all drivers, but if that runs too slow, you can just select each individual network driver. On Windows 11, when DV is enabled with the NDIS/WIFI option, if any driver misuses any NDIS_XXX_DISPATCH_LEVEL flag, you'll get an instant bugcheck at the site of the error.

(DV does not currently verify that the driver returns the IRQL to its original level -- that's a good idea for the future, though.)

Jeffrey Tippet
  • 3,146
  • 1
  • 14
  • 15
  • We are setting the NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL based on the KeCurrentIRQL (even tho tho this is called at IRP_MJ_DEVICE_CONTROL callback which i assume is always PASSIVE right?) and if its DISPATCH, we set that flag. Is this a correct way of doing it? Also are you saying that we can just not use this flag even when IRQL is dispatch, and nothing will happen? – OneAndOnly Feb 21 '22 at 11:55
  • And force IRQL checking of DV seems to be useless in this case since it just tracks raises.. So what should i do?‌ How can i find the faulty driver?! And the windows that this is happening is unfortunately 10 and not 11. – OneAndOnly Feb 21 '22 at 12:06
  • Also note that we are the one that are creating this packet, therefore we are not using the receiveFlags and instead are checking the current IRQL to set the flag, and this packet is sent from the usermode and is converted to a NBL in the drivers's device control callback. – OneAndOnly Feb 22 '22 at 07:05
  • Another shot in the dark: there hasn't been any IRQL related bug in wanarp in win10 build 18362 MP, has it? Because WanNdisReceivePacketsCalloutRoutine is the last meaningful function address on the stack (not callstack), and inside WanNdisReceivePackets there are a LOT of KeAcquireSpinLockRaiseToDpcs. Specifically i see a KeAcquireSpinLockRaiseToDpc)(&wanarp_g_rlConnTableLock) that on some condition, gets released with KeReleaseSpinLockFromDpcLevel instead of KeReleaseSpinLock? – OneAndOnly Feb 22 '22 at 13:02
  • Thanks. We didn't have any known issues in WanNdisReceivePackets, but I looked at that function and I believe you're correct: there is a codepath through which the function can imbalance the IRQL. This would only happen if the IRQL starts out at PASSIVE_LEVEL, which is probably how it's gone unnoticed for so long. I've filed internal issue task.ms/38272328 to track this: if you have a Microsoft support contract, your rep can get you status updates on that issue. Meanwhile, try wrapping your receive handler in `KeRaiseIrql(DISPATCH_LEVEL)`/`KeLowerIrql` to work around the issue. – Jeffrey Tippet Feb 23 '22 at 06:48
  • Thanks Jeffrey, I'll try that and will report back the results. – OneAndOnly Feb 23 '22 at 06:57
  • Hi Jeffrey, just an update: it seems like the issue has been resolved for the customer after wrapping it in KeRaise/Lower and passing the NDIS_RECEIVE_FLAGS_DISPATCH_LEVEL flag, so there's that. Also how can we know the progress of that internal issue if we don't have a support contract? Another note: i found out that this BSOD only happened after the customer connected to any VPN server using the Microsoft built in VPN connection ( network & sharing -> connect to workplace..), just wanted to share this thought maybe it helps. – OneAndOnly Mar 01 '22 at 06:30
  • I'm glad to hear the customer is unblocked here. Unfortunately, our internal issue tracker is not public. I can share that the issue has been assigned to the VPN team and I've provided them with enough information to make a targeted fix. – Jeffrey Tippet Mar 03 '22 at 15:03
  • 1
    A fix for this issue is now available in the latest Windows preview, build numbers 22598 and later. – Jeffrey Tippet Apr 29 '22 at 19:53
  • Hi Jeffrey, could this approach of Raising the IRQL and lowering it after indication cause a DPC_WATCHDOG_VIOLATION (133/1) BSOD? Because i just received a dump and it seems like our approach of raising and lowering around indication is causing it under heavy load? Any way to work around that? Because we can't remove this raising and lowering approach as that will also cause the BSOD that was mentioned in this thread. – OneAndOnly Nov 23 '22 at 05:33
  • It seems to be happening after our indication and when wanarp!WanNdisReceivePacket is using KeAcquireSpinLockRaiseToDpc (and also I can see the raspptp!CallIndicateReceived + ... + wanarp!WanNdisReceivePacket + KeAcquireSpinLockAtDpcLevel on the stack of another core). Should I post another thread regarding it? – OneAndOnly Nov 23 '22 at 05:41
  • Sorry for spamming here, I opened a thread regarding this issue, I would be really grateful if you could find the time to take a look at it. https://stackoverflow.com/questions/74580135/dpc-watchdog-violation-133-1-potentially-related-to-ndisfindicatereceivenetbuf – OneAndOnly Nov 26 '22 at 06:23