5

I am looking for a way to implement a function that gets an address, and tells the page size used in this address. One solution looks for the address in the segments in /proc//smaps and returns the value of "KernelPageSize:". This solution is very slow because it involves reading a file linearly, a file which might be long. I need a faster and more efficient solution.

Is there a system call for this? (int getpagesizefromaddr(void *addr);) If not, is there a way to deduce the page size?

Eldad
  • 51
  • 1
  • 2
  • `/proc/PID/smaps` is not a file, it's a pseudo-file provided by the kernel *when read*. (The `open()`/`read()`/`close()` syscalls cause the kernel to generate the data then and there as needed, it does not go through any filesystem stuff. In fact, it's surprisingly lightweight system.) It is not "very slow", especially if you use `unistd.h` I/O and a robust parser function. `stdio.h` is full-featured but slow, so avoid it. – Nominal Animal Jan 16 '14 at 11:30
  • You are correct, but I didn't mean that the slow solution is slow due to file I/O, but due to string processing. In a test I did, an smap "file" can have more than 500K lines, so the worst-case run time is high (bad). – Eldad Jan 16 '14 at 12:44
  • 2
    Does different addresses have different page size? If not, maybe this is what you need: `sysconf(_SC_PAGESIZE)` – Lee Duhem Jan 16 '14 at 13:52
  • @leeduhem: Yes, they do. On x86-64, for example, normal page size is 4 KB (4096 bytes), but huge pages are 2048 KB (2097152 bytes). – Nominal Animal Jan 17 '14 at 03:05
  • @NominalAnimal Well, in this case, `sysconf()` can not give the information you need. – Lee Duhem Jan 17 '14 at 05:46

1 Answers1

4

Many Linux architectures support "huge pages", see Documentation/vm/hugetlbpage.txt for detailed information. On x86-64, for example, sysconf(_SC_PAGESIZE) reports 4096 as page size, but 2097152-byte huge pages are also available. From the application's perspective, this rarely matters; the kernel is perfectly capable of converting from one page type to another as needed, without the userspace application having to worry about it.

However, for specific workloads the performance benefits are significant. This is why transparent huge page support (see Documentation/vm/transhuge.txt) was developed. This is especially noticeable in virtual environments, i.e. where the workload is running in a guest environment. The new advice flags MADV_HUGEPAGE and MADV_NOHUGEPAGE for madvise() allows an application to tell the kernel about its preferences, so that mmap(...MAP_HUGETLB...) is not the only way to obtain these performance benefits.

I personally assumed Eldad's guestion was related to a workload running in a guest environment, and the point is to observe the page mapping types (normal or huge page) while benchmarking, to find out the most effective configurations for specific workloads.

Let's dispel all misconceptions by showing a real-world example, huge.c:

#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>

#define  PAGES 1024

int main(void)
{
    FILE   *in;
    void   *ptr;
    size_t  page;

    page = (size_t)sysconf(_SC_PAGESIZE);

    ptr = mmap(NULL, PAGES * page, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, (off_t)0);
    if (ptr == MAP_FAILED) {
        fprintf(stderr, "Cannot map %ld pages (%ld bytes): %s.\n", (long)PAGES, (long)PAGES * page, strerror(errno));
        return 1;
    }

    /* Dump /proc/self/smaps to standard out. */
    in = fopen("/proc/self/smaps", "rb");
    if (!in) {
        fprintf(stderr, "Cannot open /proc/self/smaps: %s.\n", strerror(errno));
        return 1;
    }
    while (1) {
        char *line, buffer[1024];

        line = fgets(buffer, sizeof buffer, in);
        if (!line)
            break;

        if ((line[0] >= '0' && line[0] <= '9') ||
            (line[0] >= 'a' && line[0] <= 'f') ||
            (strstr(line, "Page")) ||
            (strstr(line, "Size")) ||
            (strstr(line, "Huge"))) {
            fputs(line, stdout);
            continue;
        }
    }

    fclose(in);
    return 0;
}

The above allocates 1024 pages using huge pages, if possible. (On x86-64, one huge page is 2 MiB or 512 normal pages, so this should allocate two huge pages' worth, or 4 MiB, of private anonymous memory. Adjust the PAGES constant if you run on a different architecture.)

Make sure huge pages are enabled by verifying /proc/sys/vm/nr_hugepages is greater than zero. On most systems it defaults to zero, so you need to raise it, for example using

sudo sh -c 'echo 10 > /proc/sys/vm/nr_hugepages'

which tells the kernel to keep a pool of 10 huge pages (20 MiB on x86-64) available.

Compile and run the above program,

gcc -W -Wall -O3 huge.c -o huge && ./huge

and you will obtain an abbreviated /proc/PID/smaps output. On my machine, the interesting part contains

2aaaaac00000-2aaaab000000 rw-p 00000000 00:0c 21613022   /anon_hugepage (deleted)
Size:               4096 kB
AnonHugePages:         0 kB
KernelPageSize:     2048 kB
MMUPageSize:        2048 kB

which obviously differs from the typical parts, e.g.

01830000-01851000 rw-p 00000000 00:00 0   [heap]
Size:                132 kB
AnonHugePages:         0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

The exact format of the complete /proc/self/smaps file is described in man 5 proc, and is quite straightforward to parse. Note that this is a pseudofile generated by the kernel, so it is never localized; the whitespace characters are HT (code 9) and SP (code 32), and newline is LF (code 10).


My recommended approach would be to maintain a structure describing the mappings, for example

struct region {
    size_t  start;    /* first in region at (void *)start */
    size_t  length;   /* last in region at (void *)(start + length - 1) */
    size_t  pagesize; /* KernelPageSize field */
};

struct maps {
    size_t           length;   /* of /proc/self/smaps */
    unsigned long    hash;     /* fast hash, say DJB XOR */
    size_t           count;    /* number of regions */
    pthread_rwlock_t lock;     /* region array lock */
    struct region   *region;
};

where the lock member is only needed if it is possible that one thread examines the region array while another thread is updating or replacing it.

The idea is that at desired intervals, the /proc/self/smaps pseudofile is read, and a fast, simple hash (or CRC) is calculated. If the length and the hash match, then assume mappings have not changed, and reuse the existing information. Otherwise, the write lock is taken (remember, the information is already stale), the mapping information parsed, and a new region array is generated.

If multithreaded, the lock member allows multiple concurrent readers, but protects against using a discarded region array.

Note: When calculating the hash, you can also calculate the number of map entries, as property lines all begin with an uppercase ASCII letter (A-Z, codes 65 to 90). In other words, the number of lines that begin with a lowercase hex digit (0-9, codes 48 to 57, or a-f, codes 97 to 102) is the number of memory regions described.


Of the functions provided by the C library, mmap(), munmap(), mremap(), madvise() (and posix_madvise()), mprotect(), malloc(), calloc(), realloc(), free(), brk(), and sbrk() may change the memory mappings (although I'm not certain this list contains them all). These library calls can be interposed, and the memory region list updated after each (successful) call. This should allow an application to rely on the memory region structures for accurate information.

Personally, I would create this facility as a preload library (loaded using LD_PRELOAD). That allows easily interposing the above functions with just a few lines of code: the interposed function calls the original function, and if successful, calls an internal function that reloads the memory region information from /proc/self/smaps. Care should be taken to call the original memory management functions, and to keep errno unchanged; otherwise it should be quite straightforward. I personally would also avoid using library functions (including string.h) to parse the fields, but I am overly careful anyway.

The interposed library would obviously also provide the function to query the page size at a specific address, say pagesizeat(). (If your application exports a weak version that always returns -1 with errno==ENOTSUP, your preload library can override it, and you don't need to worry about whether the preload library is loaded or not -- if not, the function will just return an error.)

Questions?

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • Thanks. A solution of maintaining smaps parsed is not what I need. It is not simple and required allocating and maintaining a DB. My final solution is to actually work around this issue. Thanks anyway. – Eldad Jan 28 '14 at 15:52
  • 1
    @Eldad: Sounds sensible. Like I said in my answer, it is very rare to actually need to know the page size. Even for the performance benefits, just giving the kernel hints (with graceful fall-back to normal pages) is almost always enough. Then again, you only asked about how to get the page size; why would you expect anyone to know what your actual problem and context was? Instead of asking help with the solution you've decided on, you probably should have described and asked about your actual, original problem. – Nominal Animal Jan 28 '14 at 16:22
  • I'm sorry I wasn't fully clear, I don't want a solution which involves reading and parsing 'smaps' because that it is not efficient. – Eldad Jan 29 '14 at 17:27
  • @Eldad: There is no existing syscall, or any other way to deduce the kernel page size of an arbitrary address belonging to the process. This is the only currently available solution in current Linux kernels. The only other option you have is to create a Linux kernel module that provides the information necessary. A character device driver exporting a simple `ioctl()` for this should not be too difficult. The tricky part is walking the kernel data structures correctly, to find the necessary information. But that is a totally different question, and even that might be too slow for your needs. – Nominal Animal Jan 29 '14 at 19:18
  • Thanks. Indeed, it looks like one will have to go deep into vma to get that info. This is not the O(1) solution I need, so I worked around the entire problem. Thanks anyway. – Eldad Jan 31 '14 at 08:08