I don't know about Linux, so I'll answer for Windows. Some of the kernel space is 'global', which is a flag set in the PTE to indicate it is used by more than one process. The INVPCID
instruction can be configured in the register operand to include or exclude these entries in a TLB invalidate. These page table entries are shared between the processes and all appear at the same place in the page table for each process. This way, only the single PTE needs to be updated and it doesn't need to synchronise other PTEs of other processes as they all share a single PTE at a physical address.
http://www.cs.miami.edu/home/burt/journal/NT/memory.html
Some kernel memory is not visible to all processes and is private to each process (doesn't change the fact it is still ring 0). This, on a 32 bit Windows system would be 0xC0000000–0xC0200000 which contains all the user space PTEs and PDEs where 0xC0000000 is the PTE_BASE which allows for the equation
#define MiGetPteAddress (x) ((PMMPTE)(((((ULONG)(x)) >> 12) << 2) + (ULONG_PTR)MmPteBase))
#define MiAddressToPte(x) MiGetPteAddress(x)
to work elegantly for converting faulting virtual address in cr2
to the address of the PTE. This is private to each process as each process has the same base PTE allocation base address; if it were visible to all processes it would quickly take up virtual memory as each set of page takes would have to be allocated sequentially. It doesn't need to be visible to all processes because a process has no interest in the page table entries of another process. A page fault is always handled in the context of the current process, and 0xC0000000–0xC0200000 means something different in each process context.
The kernel space 0xC0200000–0xC0400000 for allocation of kernel PTEs (for kernel addresses) would however be global and shared by all processes, except for the section within it representing 0xC0000000–0xC0200000, which by my calculation will be 0xC0300000–0xC0300800, which is the user-mode side of the PDEs as PDE_BASE = 0xC0300000–0xC0300FFF.
It is however impossible to split up the user PDE and kernel PDE section such that the former is private and the latter is global (i.e. make 0xC0300000–0xC0300800 private (point to different physical addresses) and 0xC0300000–0xC0300FFF point to the same physical address for each process) because the whole PDE region (0xC0300000–0xC0300FFF) will lie on the same physical frame and constitutes a single frame pointed to by cr3
, and the cr3
is different for each process, which means that the whole PDE region (all PDEs) would have to private per process (duplicated and installed per process). If a kernel page table page (a page containing a kernel page table) were paged out and in to a new physical location then the PDEs would all have to be synchronised because all processes have copies at different cr3 physical addresses and not the same physical PDE. I'm not sure how it does this (efficiently) ATM therefore it would be wise to impose the restriction of not allowing the kernel page tables to be paged out and have them in non-paged pool; this way the kernel PDEs will remain constant across all CR3 pages. On 64 bit, the restriction could be imposed that kernel PDPTs can't be paged out. On 32 bit Windows, a process is started with a physical CR3 page with a PDE at offset 1100000000(base 2)*4 bytes pointing to itself which is hardwritten in, probably by briefly turning off paging in cr0 (because the write won't succeed without the recursive entry that needs to be written being there, creating a paradox). Notice, the PD Entry for itself is the page table that covers the range 0xC0000000–0xC0400000 i.e. it points to 1023 page tables and 1 page directory (itself) (2^10 entries) and hence allows the PTEs to be modified by their virtual address. The reason why the CR3 page is at 0xC0300000 is because the address has the same page directory and page table indexes 1100000000 and 1100000000 so it loops back on itself twice, therefore yielding the CR3 page and you can modify the PDEs by address (there are other addresses that are special like this e.g. 0xE0380000). After it is set up, the appropriate kernel mappings are made. On 64 bit Windows it would be similar where a process is set up with a single PML4 table page which points to itself and this way any PML4E, PDPTE, PDE or PTE can be filled in and accessed due to the variable amount of loopbacks. On 64 bit Windows, when a process is terminated, all the physical pages of the process get moved to the free list which would include all user physical PDPT pages, PD pages, PT pages and the PML4/CR3 page. The kernel ones would not be marked for the free list.
In general, if you know what entry in the PML4 is the recursive entry to the physical PML4 page you, can work out the virtual address of the PTE structure that services (is used to translate) a particular virtual address range and a particular virtual address in that range. You append the offset (10 bits for 32 bit; 9 bits for 64 bit) in the PML4 to the entry to itself, to the start of a virtual address whose servicing PTE virtual address you want to find (which is what the addition of 0xC0000000 is in the 32 bit equation earlier) and remove the last 12 bits and then make up the offset in the PT now at the end of the virtual address to 12 bits by multiplying it by 8 (or 4) (hence the right shift by 12 and the left shift by 3 (or 2 for 32 bit entries)). 1 loopback takes away 1 layer of indirection and you get the virtual address of the PTE. 2 loopbacks will leave you with the virtual address of the PDE that's used to translate that particular virtual address and so on. PTE_BASE on 32 bit windows is the offset 110000000 left shifted to make 32 bits and PDE_BASE is the offset 110000000110000000 left shifted to make 32 bits. It is used in the macro and any virtual address with this prefix will by definition be part of a PTE or a PDE respectively. Windows chooses the offset 1100000000 for the page table hierarchy but it could be any one of the 2^9 combinations.
KAISER, or KPTI, designed to mitigate meltdown, most likely has 2 cr3
s for each process. Upon trapping to the kernel, the restricted cr3 for user mode which would contain a single kernel PML4E—enough for a preliminary interrupt dispatch routine function to be accessible, which performs the swap—would be replaced with the full cr3 containing all kernel PML4Es.
As for physical memory on windows, see here: https://superuser.com/a/1549970/933117