5

We received a large patch with about 17000 files modified. Its size is 5.2G. When applying the patch with git apply -3, it didn't finish after 12 hours.

We split the patch into smaller patches per file and applied them one by one, so that at least we could see the progress.

Once again, it got stuck at one of the file patches, which is still as large as 111M. It modifies an HTML file.

We split this file patch into smaller patches per chunk and got about 57000 chunk patches. Each chunk patch takes around 2-3 seconds so it would take more time than applying the file patch. I'll try splitting it by more chunks.

Is there any method to efficiently apply such large patches? Thanks.

Update:

As @ti7 suggested, I tried patch and it solved the problem.

In my case, we have 2 kinds of large patches.

One is adding/removing a large binary and the content of the binary is contained as text in the patch. One of the binaries is 188M and the patch size that removes it is 374M.

The other is modifying a large text and has millions of deletions and insertions. One of the text files is 70M before and 162M after. The patch size is 181M and has 2388623 insertions and 426959 deletions.

After some tests, I think here "large" describes the number of the insertions and deletions.

For the binary patch,

  • git apply -3, 7 seconds
  • git apply, 6 seconds
  • patch, 5 seconds

For the text patch,

  • git apply -3, stuck, not finished after 10 minutes
  • git apply, stuck, not finished after 10 minutes
  • patch, 3 seconds

The binary has only 1 insertion and/or 1 deletion. git apply or patch can finish in seconds. All are acceptable.

The text has too many insertions and deletions. Obviously, patch is much better in this case. I read some posts on patch and got to know that some versions of patch could not work with adding/removing/renaming a file. Luckily, the patch on my machine works well.

So we split the all-in-one patch into smaller patches per file. We try timeout 10s git apply -3 file_patch first. If it cannot finish in 10 seconds, try timeout 10s patch -p1 < file_patch.

At last, it took about 1 and a half hours to apply all the 17000 patches. It's much better than applying the all-in-one patch and getting stuck for 12 hours with nothing done.

And I also tried patch -p1 < all_in_one_patch. It took only 1m27s. So I think we can improve our patch flow further more.

ElpieKay
  • 27,194
  • 6
  • 32
  • 53
  • 2
    I've never worked with a git repo that is larger than tens of megabytes. I mean the entire repo, including all history from the project's inception. I can't even imagine a change set that is 5.2 GB. Did someone commit some large binary files? – Code-Apprentice Apr 21 '22 at 03:48
  • @Code-Apprentice In my case, large binary files are not the trouble. A binary file only has one chunk. It would fail or succeed quickly. The problem is the patch has too many files and some of the text files have too many chunks. – ElpieKay Apr 21 '22 at 03:52
  • you may be able to use [`patch`](https://man7.org/linux/man-pages/man1/patch.1.html) instead of `git apply` and then add and commit afterwards – ti7 Apr 21 '22 at 03:52
  • it could also be that there's some outside issue, like a special filesystem which is doing a lot of work - how is the system load? if you have enough memory, allocating a tmpfs to work in may be much faster – ti7 Apr 21 '22 at 03:54
  • @ti7 thanks for your suggestion. I'll try `patch`. The system load is heavy. A lot of jobs are running on it. Applying the patches is one of the automated jobs. But it got stuck this time. – ElpieKay Apr 21 '22 at 03:58
  • 5
    As a bit of background, `git apply` attempts to apply the entire patch in memory, before it starts to write out the modified files. The intent is that it does not leave behind a partially modified worktree in case that a patch fails half way through. – j6t Apr 21 '22 at 05:45
  • 1
    @ti7 I tried `patch`. It applied the 111M file patch very quickly, in just 2 seconds. Would you please write it as an answer so that I can accept it? – ElpieKay Apr 21 '22 at 06:55
  • Be careful with splitting a patch: as the `git diff` man page says: **"All the file1 files in the output refer to files before the commit, and all the file2 files refer to files after the commit. It is incorrect to apply each change to each file sequentially. For example, this patch will swap a and b [...]".** So splitting a patch in 2 patches applied sequentially might make it invalid. – Gabriel Devillers Jun 21 '23 at 16:30

2 Answers2

2

You may be able to use patch (Wikipedia) instead of git apply to speed up patching!

To my knowledge, patch directly spools out a new file by-lines, splicing in the changes as it goes, while git apply does additional context checking (and as @j6t notes in a comment, though I haven't confirmed it, will attempt to load and patch the entire file at once before writing it out)

ti7
  • 16,375
  • 6
  • 40
  • 68
  • Thanks! I used `cd path_to_repository; patch -p1 < path_to_patch`. – ElpieKay Apr 21 '22 at 08:19
  • Note that `patch` does not match all features of `git apply`, for instance it does [not understand](https://stackoverflow.com/questions/50677861/git-binary-diffs-are-not-supported-error-using-yocto) `GIT binary patch`es. – Gabriel Devillers Jun 22 '23 at 15:36
2

Another argument for patch: git apply is now officially limited to 1GB.

With Git 2.39 (Q4 2022), "git apply"(man) limits its input to a bit less than 1 GiB.

See commit f1c0e39 (25 Oct 2022) by Taylor Blau (ttaylorr).
(Merged by Taylor Blau -- ttaylorr -- in commit c41ec63, 30 Oct 2022)

apply: reject patches larger than ~1 GiB

Reported-by: 정재우
Suggested-by: Johannes Schindelin
Signed-off-by: Taylor Blau

The apply code is not prepared to handle extremely large files.
It uses "int" in some places, and "unsigned long" in others.

This combination leads to unfortunate problems when switching between the two types.
Using "int" prevents us from handling large files, since large offsets will wrap around and spill into small negative values, which can result in wrong behavior (like accessing the patch buffer with a negative offset).

Converting from "unsigned long" to "int" also has truncation problems even on LLP64 platforms where "long" is the same size as "int", since the former is unsigned but the latter is not.

To avoid potential overflow and truncation issues in git apply(man), apply similar treatment as in dcd1742 ("xdiff: reject files larger than ~1GB", 2015-09-24, Git v2.7.0-rc0 -- merge listed in batch #2), where the xdiff code was taught to reject large files for similar reasons.

The maximum size was chosen somewhat arbitrarily, but picking a value just shy of a gigabyte allows us to double it without overflowing 2^31-1 (after which point our value would wrap around to a negative number).
To give ourselves a bit of extra margin, the maximum patch size is a MiB smaller than a full GiB, which gives us some slop in case we allocate "(records + 1) * sizeof(int)" or similar.

Luckily, the security implications of these conversion issues are relatively uninteresting, because a victim needs to be convinced to apply a malicious patch.


As noted by Gabriel Devillers in the comments:

I tried to apply a patch of size 1.6 GB with Git 1.41 and got error:

git apply: failed to read: No such file or directory 

which is totally unclear.


With Git 2.42 (Q3 2023), "git apply"(man) punts when it is fed too large a patch input; the error message it gives when it happens has been clarified.

See commit 42612e1 (26 Jun 2023) by Phillip Wood (phillipwood).
(Merged by Junio C Hamano -- gitster -- in commit 84b889b, 06 Jul 2023)

apply: improve error messages when reading patch

Reported-by: Premek Vysoky
Signed-off-by: Phillip Wood

Commit f1c0e39 ("apply: reject patches larger than ~1 GiB", 2022-10-25, Git v2.39.0-rc0 -- merge listed in batch #9) added a limit on the size of patch that apply will process to avoid integer overflows.
The implementation re-used the existing error message for when we are unable to read the patch.
This is unfortunate because (a) it does not signal to the user that the patch is being rejected because it is too large and (b) it uses error_errno() without setting errno.

This patch adds a specific error message for the case when a patch is too large.
It also updates the existing message to make it clearer that it is the patch that cannot be read rather than any other file and marks both messages for translation.
The "git apply"(man) prefix is also dropped to match most of the rest of the error messages in apply.c (there are still a few error messages that prefixed with "git apply" and are not marked for translation after this patch).
The test added in f1c0e39 is updated accordingly.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • I tried to apply a patch of size 1.6 GB with Git 1.41 and got `error: git apply: failed to read: No such file or directory` which is totally unclear. It is unfortunate that they do not explicitly give the reason because your answer is currently not found when searching this error online. – Gabriel Devillers Jun 21 '23 at 15:55
  • 1
    @GabrielDevillers Good point. I have included your comment in the answer for more visibility. – VonC Jun 21 '23 at 19:43