Say that two different processes each open two different files. Normally, they would each have their own inode and each inode would have their own struct address_space
(this is the guy who remembers where the page cache pages are in memory).
But, let's say I knew that these files were initially identical. I want to come up with a way to smart share caching to the extent possible.
I was considering these strategies:
Add a new field to the
struct address_space
struct: a pointer to a "parent". Then, whenever I look for an existing page, I'll also look in the parent (if it exists). Whenever I write to a page, I will therefore need to fault and C-O-W the page into the main address_space. Both files will share the common parent.Group each related set of
struct address_space
in a linked list. Whenever I look for an existing page, search the entire linked list. In this scenario, though, it would be disallowed to "find" a dirty page on a friend's address_space. In other words, if a page gets dirty it can't be used as a backup anymore. In this scenario, if anyone ever wrote data to the file, I would need to disassociate the address_spaces. I would also need some sort of C-O-W behavior to sustain this as well.
Can anyone tell me:
- Is either or both of these ideas are sound?
- What things in particular should I watch out for?
As a point of reference, I am doing a custom kernel hack to save memory because on my system there are multiple identical files being opened (but not the same inode = not sharing pagecache).
EDIT: 3rd idea:
- Keep a linked list of the "related" pagecache
address_space
and then every time we read from disk, update everyaddress_space
struct that's open. Opening a new related file would have to cause a big page table copying thing to happen, except skip any dirty pages.