This is inelegant, so I'd be interested in better ideas, but here's a summary of what I've been able to make basically work.
This is a two prong approach of a small bootstrap payload inserted into padding of the original elf, which then mmap()'s an arbitrarily larger binary blob in to do the actual work.
Part I: the bootstrap payload
Basically I insert a small amount of code in some padding between the .ARM.exidx section (which loads at the top of the code segment) and the .preinit_array section. This code just opens another binary blob and mmap()s it as read only and executable at a hard coded virtual address I hope is safe.
In order to get my inserted code to load as part of the main executable, I had to modify the size of the load segment in the elf file, in this case it's second phdr struct which starts at 0x54. Both the p_filesz at 0x64 (0x54+0x20) and the p_memsz at 0x68 (0x54+0x24) were changed.
I also changed the e_entry start address in the elf header at offset 0x18 to point to my inserted code. My inserted code does a jump to the old start address when it's done setting things up (actually it first jumps to stage two setup in the larger payload, which then jumps to the original).
Finally I changed the statically linked syscall stubs for the functions I wanted to trap to point to my replacements at the load address of the larger payload I'm mmap()ing in.
Part II: the big payload
This implements whatever modifications are being made - in my case, replacing syscalls with functions that log when certain conditions are met. Since the main executable is statically linked, this would have to be as well - or more simply, it can't use the C library. Instead it uses assembly language to issues syscalls for basic I/O. I realized that without being loaded as an executable I have no persistent local variable storage, so on startup I mmap() an anonymous page to hold local variables - mostly the fd of the file I'm logging to and the device driver fd's operations on which should be logged.
Compiling this part is a bit inelegant. I'm compiling to assembly with the -S switch to gcc, then removing all section keywords. I then pass that back through gcc to assemble and generate an object. I run this through the linker specifying the name of my first function as the entry point (-e) and using a customization of the usual linker script which removes the 0x8000 start offset. But there's still some offset due to headers, in this case 128 bytes. to preserve the fixups I objcopy the contents of the linked elf out into a binary blob, dd myself 128 bytes from /dev/zero, and cat that onto the beginning....
As I was saying... this is inelegant, so I'm open to better ideas