0

I have a weird problem with some kernel code I have written. I can't share the exact code, but I can give the general idea of what's going on.

I'm work on a project (windows) which modifies the page tables of a process in order to modify a function in memory, via changing the PFN in the PTE to another physical page with different contents. I am doing this in order to hook a function.

Once the hooked function is called, it does some processing that looks like this:

void HookFunctionViaPTE()
{

// get a pointer to the PTE for the function "MyRoutine"

PPTE pte = RetrievePTE(&MyRoutine);
pte->PFN = g_HijackedCodePfn;

// g_HijackedCodePfn is the PFN of an allocated page in memory containing a copy of the page which "MyRoutine" lies in. Overwrite it with a jump to "MyRoutine_Hook"

memcpy(g_HijackedCodePtr + VirtualAddress.PFNOffset, hookCode, sizeof(hookCode));

}

void MyRoutine_Hook(
     PVOID context
)
{

// some work here

// call original version of this function
// setup PTE to point to old physical page

RestoreOriginalPFNInPTE();

__writecr3(__readcr3());

// this should call into the original code and not into this hook recursively 
MyRoutine();

// go back to hacked context
RestoreHackedPFNInPTE();

__writecr3(__readcr3());

// other work here
}

Essentially, within the function hook I modify the page tables so the original data is pointed to again in RAM so when I call the function recursively it calls the original instead of going back into the hook again.

Slight problem though -- everything works perfectly when stepping through each line with a debugger. However, when letting the code run freely it seems as if the CPU forgets that I have changed the page tables, when MyRoutine is called in the hook, it calls the hook again. I've tried pretty much everything to fix it, including invalidating the paging entries, flushing the entire TLB, and even recreating the paging structures in a separate physical page and then setting cr3 to that. But nothing really fixes the problem.

I had some success using __wbinvd, but the behavior was strange. I had to place it right before the function call to make any noticeable difference, but even then it didn't work.

While an exact solution isn't clear from the lack of source, can someone explain possible conditions that cause the CPU to act like this? Or what I can do to help.

Arush Agarampur
  • 1,340
  • 7
  • 20
  • Do `g_HijackedCodePtr` and `RestoreHackedPFNInPTE()` and "another physical page with different contents" indicate that you are trying to abuse the system? And "an exact solution isn't clear from the lack of source" suggests that you are failing to use *someone else's* code to do it? – Weather Vane Jul 06 '23 at 20:58
  • @WeatherVane Technically abuse a specific process. I said lack of source because as you can see the code I posted is very basic and does not actually contain functional statements, its still my code. – Arush Agarampur Jul 06 '23 at 21:04
  • I don't totally understand what you're doing, but the description of your design doesn't really seem like it can work reliably. PTE granularity is pages, but functions aren't organized neatly onto memory pages. – Barmar Jul 06 '23 at 21:21
  • @WeatherVane It seems like it's some kind of monkey patching mechanism. "hijacked" refers to the code that's being patched. – Barmar Jul 06 '23 at 21:22
  • @Barmar I have omitted code that rounds down and locates the function after each page boundary – Arush Agarampur Jul 06 '23 at 22:19
  • 1
    I haven't actually looked at the code, I don't know enough about MMU details to comment, I was just commenting on the general idea. If you think you've taken care of that problem, OK. It just smells funny to me. – Barmar Jul 06 '23 at 22:32
  • I'd assume Windows enables PCID (Process Context IDs) so TLB entries are tagged with an ID. Writing CR3 doesn't invalidate TLB entries, and writing it with the same CR3 (including the PCID in the low bits) allows it to hit on the existing TLB entries with the same PCID tag. Use `invlpg` to invalidate pages. – Peter Cordes Jul 07 '23 at 01:51
  • 1
    Do you invalidate caches after changing the paging? During debug the switching through execution-debug should perform cache invalidation transparently. – Frankie_C Jul 07 '23 at 07:19
  • 1
    Yes, it seems that was the problem. Adding `__invlpg()` worked @PeterCordes, but I still don't understand why you say writing CR3 doesn't invalidate TLB entries, as osdev wiki says "...the TLB has to be flushed upon such a change. On x86 systems, this can be done by writing to the page directory base register (CR3)." I thought that since I'm writing to CR3 that a `invlpg` won't be needed. – Arush Agarampur Jul 07 '23 at 07:51
  • To invalidate the full TLB can have a big cost on the whole system performance, so I don't think that writing CR3 invalidates whole TLB, the strategy should depend on what the paging modification are. See the considerations made for Linux kernel at https://www.kernel.org/doc./html/latest/arch/x86/tlb.html – Frankie_C Jul 07 '23 at 09:06
  • See https://www.kernel.org/doc/Documentation/x86/pti.txt / [Does Linux use x86 CPU's PCID feature for TLB? If not, why?](https://stackoverflow.com/q/20155304) for more details about what the feature is. The OSDev Wiki entry you're quoting is assuming you aren't using PCIDs. If you were writing your own OS, you'd know if you'd enabled that bit in a control register, and would only have done so after reading about PCIDs. – Peter Cordes Jul 07 '23 at 12:39

0 Answers0