4

Is there any way in Linux, using , to generate a diff/patch of two files stored in memory, using a common format (ie: unified diff, like with the command-line diff utility)?

I'm working on a system where I generate two text files in memory, and no external storage is available, or desired. I need to create a line-by-line diff of the two files, and since they are mmap'ed, they don't have file names, preventing me from simply calling system("diff file1.txt file2.txt").

I have file descriptors (fds) available for use, and that's my only entry point to the data. Is there any way to generate a diff/patch by comparing the two open files? If the implementation is MIT/BSD licensed (ie: non-GPL), so much the better.

Thank you.

Barry Michael Doyle
  • 9,333
  • 30
  • 83
  • 143
Cloud
  • 18,753
  • 15
  • 79
  • 153
  • I can't find the way to call `diff` with 2 standard input arguments as files, but that would be a way to do it. – Jean-François Fabre Feb 21 '17 at 20:44
  • another way would be to use `comm - -` and feed lines alternatively but it only works if the files are synchronized. – Jean-François Fabre Feb 21 '17 at 20:47
  • I know you said "no external storage is available, or desired". But have you considered using a simple in-memory filesystem like ramfs? It comes on basically all linux distros and AFAIK its main use is to throw up a temporary filesystem during early boot. – Mitchell Gouzenko Feb 21 '17 at 20:58
  • 1
    Possible duplicate http://stackoverflow.com/questions/1451694/is-there-a-way-to-diff-files-from-c – 0andriy Feb 21 '17 at 21:23
  • @MitchellGouzenko Part of my requirement is that no-one else is able to access the files, save for the `root` user attaching to the process via GDB to forcibly get access to the data. – Cloud Feb 21 '17 at 21:36
  • @MitchellGouzenko If I were somehow able to create a private ram-disk on a per-process basis, that would be acceptable, but I don't see anyway to do it in pure C without using non-portable `system(" ... ")` calls. – Cloud Feb 21 '17 at 21:47

2 Answers2

4

On Linux you can use the /dev/fd/ pseudo filesystem (a symbolic link to /proc/self/fd). Use snprintf() to construct the path for both file descriptors like snprintf(path1, PATH_MAX, "/dev/fd/%d", fd1); ditto for fd2 and run diff on them.

Ricardo Branco
  • 5,740
  • 1
  • 21
  • 31
3

Considering the requirements, the best option would be to implement your own in-memory diff -au. You could perhaps adapt the relevant parts of OpenBSD's diff to your needs.


Here's an outline of one how you can use the /usr/bin/diff command via pipes to obtain the unified diff between two strings stored in memory:

  1. Create three pipes: I1, I2, and O.

  2. Fork a child process.

  3. In the child process:

    1. Move the read ends of pipes I1 and I2 to descriptors 3 and 4, and the write end of pipe O to descriptor 1.

    2. Close the other ends of those pipes in the child process. Open descriptor 0 for reading from /dev/null, and descriptor 2 for writing to /dev/null.

    3. Execute execl("/usr/bin/diff", "diff", "-au", "/proc/self/fd/3", "/proc/self/fd/4", NULL);

      This executes the diff binary in the child process. It will read the inputs from the two pipes, I1 and I2, and output the differences to pipe O.

  4. The parent process closes the read ends of the I1 and I2 pipes, and the write end of the O pipe.

  5. The parent process writes the comparison data to the write ends of I1 and I2 pipes, and reads the differences from the read end of the O pipe.

    Note that the parent process must use select() or poll() or a similar method (preferably with nonblocking descriptors) to avoid deadlock. (Deadlock occurs if both parent and child try to read at the same time, or write at the same time.) Typically, the parent process must avoid blocking at all costs, because that is likely to lead to a deadlock.

    When the input data has been completely written, the parent process must close the respective write end of the pipe, so that the child process detects the end-of-input. (Unless an error occurs, the write ends must be closed before the child process closes its end of the O pipe.)

    When the parent process notices that no more data is available in the O pipe (read() returning 0), either it has already closed the write ends of the I1 and I2 pipes, or there was an error. If there is no error, the data transfer is complete, and the child process can be reaped.

  6. The parent process reaps the child using e.g. waitpid(). Note that if there were any differences, diff returns with exit status 1.

You can use a fourth pipe to receive the standard error stream from the child process; diff does not normally output anything to standard error.

You can use a fifth pipe, write end marked O_CLOEXEC with fcntl() in the child, to detect execl() errors. O_CLOEXEC flag means the descriptor is closed when executing another binary, so the parent process can detect successful starting of the diff command by detecting the end-of-data in the read end (read() returning 0). If the execl() fails, the child can e.g. write the errno value (as a decimal number, or as an int) to this pipe, so that the parent process can read the exact cause for the failure.

In all, the complete method (that both records standard error, and detects exec errors) uses 10 descriptors. This should not be an issue in a normal application, but may be important -- for example, consider an internet-facing server with descriptors used by incoming connections.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • `diff -au <(sort file_1.txt) <(sort file_2.txt)` worked for me (use case was sorting prior to diff, which seems like it maybe ought to be built-in to `diff` but isn't) – ijoseph Aug 02 '22 at 19:57