7

I have a file that is very large (>500GB) that I want to prepend with a relatively small header (<20KB). Doing commands such as:

cat header bigfile > tmp
mv tmp bigfile

or similar commands (e.g., with sed) are very slow.

What is the fastest method of writing a header to the beginning of an existing large file? I am looking for a solution that can run under CentOS 7.2. It is okay to install packages from CentOS install or updates repo, EPEL, or RPMForge.

It would be great if some method exists that doesn't involve relocating or copying the large amount of data in the bigfile. That is, I'm hoping for a solution that can operate in fixed time for a given header file regardless of the size of the bigfile. If that is too much to ask for, then I'm just asking for the fastest method.

Compiling a helper tool (as in C/C++) or using a scripting language is perfectly acceptable.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Steve Amerige
  • 1,309
  • 1
  • 12
  • 28
  • Standard file systems treat files as linked lists of (blocks of) bytes, meaning that other than appending to a file, editing one requires rewriting the entire thing. – chepner Jun 17 '16 at 13:08
  • You might be able to make it faster in C++, where you can adjust buffer sizes. There is no way that you can do without the copying, though. – molbdnilo Jun 17 '16 at 13:22
  • http://stackoverflow.com/questions/2503254/unix-prepending-a-file-without-a-dummy-file – xxfelixxx Jun 17 '16 at 13:25
  • @xxfelixxx yeah, I've searched stackoverflow looking for answers, but these are all slow. It might come down to looking for the fastest way to open a file and do block copies, then lseek to 0 and write the header. – Steve Amerige Jun 17 '16 at 13:34
  • Maybe look into using `dd` – xxfelixxx Jun 17 '16 at 13:35
  • This question is too broad. It requires kernel-level work. Filesystems usually use tree structures which suggest the possibility of an effficient insertion of blocks at the beginning or anywhere else in the file. A file could be fitted with an value which indicates that its true start is offset from the start of the first block, allowing material to be inserted which is a fraction of a block long. – Kaz Jun 17 '16 at 13:35
  • I think `dd` command might make things a little faster compared to `cat` but as others have pointed out you may have to create a new file and then rename the new file to bigfile. – Fazlin Jun 17 '16 at 13:36
  • 1
    Another possibility is to write a "catenating file system" (perhaps as a FUSE module): a file system whose files are virtually composed as catenations of other files. Then to insert a header, we simply reconfigure a catenated file to include that header. – Kaz Jun 17 '16 at 13:37
  • 1
    Can one resize the file, use the C++ memmove, and then lseek to the beginning and write the header? I don't know how to make the question more specific than qualifying it to CentOS 7.2 and saying what I want to accomplish. This question has been asked elsewhere with fewer qualifications. I'm specifically interested in performance as measured by time to complete the operation. – Steve Amerige Jun 17 '16 at 13:41
  • @SteveAmerige: However you do this, the data must read into memory and rewritten. The read alone will take 1.5 hours and there's nothing you can do to change that. Yes, a *memmove* would be faster, but the data has to be in memory first, so there is still a 90 minute wait beforehand, and that's if you're lucky enough to have a system with 500GB of RAM. Plus, moving it in memory doesn't help you do relocate the on-disk information so you'll have to wait *another* couple of hours to write it back to disk – Borodin Jun 17 '16 at 14:15
  • 1
    Possible option: Keep the lines in the files "backwards", so that the first one you want to read is last in the file. To prepend a line, you just need to append it to the file. Use File::ReadBackwards to read the file in the correct order. – ikegami Jun 17 '16 at 14:29
  • @Kaz "Another possibility is to write a "catenating file system" (perhaps as a FUSE module)" Why don't you post this as an answer? It will have at least one upvote (from me). – Leon Jun 17 '16 at 18:33
  • @SteveAmerige BTW, what data is stored in your file. Is it possible to reshuffle it? – Leon Jun 17 '16 at 18:38
  • Would it be possible to just write a separate header file? file1.blob and file1.header This assumes you're the one making use of the header file, of course. If you're sending it somewhere else that needs the header first, that's a different issue. – Altainia Jun 17 '16 at 18:47
  • @SteveAmerige Going off of ikegami's comment here -- why not just **append the header**, and then process such files accordingly. Not a systemic solution, but it is practical and it should do precisely what you need. You could first append a marker indicating that a header follows for a fairly reasonable solution to an ad-hoc and sudden problem. Processing of such files can be nicely merged with those that do have a header upfront. – zdim Jun 17 '16 at 23:28
  • @zdim: I get the idea that this is a one-off modification to a number of large files, and I'm inclined to advise that the OP should bite the bullet and get it done. 90 minutes per file (as long as the new file is located on a different drive) isn't so bad, and several such files could be updated overnight in a batch. The solutions below seem to be just the author flexing their IT muscles, and certainly `mmap` / `memmove` will be one of the *slowest* ways because it doubles the load on the disk drive. – Borodin Jun 18 '16 at 16:29
  • @Borodin I think that it's a fine solution to fix up a few files like this and be done with it. If the need comes up occasionally in real time, appending the header is nearly immediate and it is feasible to have code in place that can process such files along with others. I was going to post some code but the question got put on hold. – zdim Jun 18 '16 at 23:57
  • 1
    @Borodin Also, I must say that I didn't get the point of suggesting to "write" a custom filesystem or move the project to a "platform" that supports this or that, ignoring the clear restriction to _CentOS 7.2_ and the practical nature of the question. These discussions are cool but do not seem to address the problem at hand. I certainly don't see how `mmap` and `memmove` were going to help at all. If this were a systemic problem it would clearly be better to write those files differently to start with. – zdim Jun 19 '16 at 00:03
  • Other than tagging a bunch of specific languages for a question that isn't language-specific at all (but is rather OS/kernel-specific), this strikes me as a perfectly good question. – Charles Duffy Feb 06 '17 at 16:13

2 Answers2

7

Is this something that needs to be done once, to "fix" a design oversight perhaps? Or is it something that you need to do on a regular basis, for instance to add summary data (for instance, the number of data records) to the beginning of the file?

If you need to do it just once then your best option is just to accept that a mistake has been made and take the consequences of the retro-fix. As long as you make your destination drive different from the source drive you should be able to fix up a 500GB file within about two hours. So after a week of batch processes running after hours you could have upgraded perhaps thirty or forty files

If this is a standard requirement for all such files, and you think you can apply the change only when the file is complete -- some sort of summary information perhaps -- then you should reserve the space at the beginning of each file and leave it empty. Then it is a simple matter of seeking into the header region and overwriting it with the real data once it can be supplied

As has been explained, standard file systems require the whole of a file to be copied in order to add something at the beginning

If your 500GB file is on a standard hard disk, which will allow data to be read at around 100MB per second, then reading the whole file will take 5,120 seconds, or roughly 1 hour 30 minutes

As long as you arrange for the destination to be a separate drive from the source, your can mostly write the new file in parallel with the read, so it shouldn't take much longer than that. But there's no way to speed it up other than that, I'm afraid

Borodin
  • 126,100
  • 9
  • 70
  • 144
7

If you were not bound to CentOS 7.2, your problem could be solved (with some reservations1) by fallocate, which provides the needed functionality for the ext4 filesystem starting from Linux 4.2 and for the XFS filesystem since Linux 4.1:

int fallocate(int fd, int mode, off_t offset, off_t len);

This is a nonportable, Linux-specific system call. For the portable, POSIX.1-specified method of ensuring that space is allocated for a file, see posix_fallocate(3).

fallocate() allows the caller to directly manipulate the allocated disk space for the file referred to by fd for the byte range starting at offset and continuing for len bytes.

The mode argument determines the operation to be performed on the given range. Details of the supported operations are given in the subsections below.

...

Increasing file space

Specifying the FALLOC_FL_INSERT_RANGE flag (available since Linux 4.1) in mode increases the file space by inserting a hole within the file size without overwriting any existing data. The hole will start at offset and continue for len bytes. When inserting the hole inside file, the contents of the file starting at offset will be shifted upward (i.e., to a higher file offset) by len bytes. Inserting a hole inside a file increases the file size by len bytes.

...

FALLOC_FL_INSERT_RANGE requires filesystem support. Filesystems that support this operation include XFS (since Linux 4.1) and ext4 (since Linux 4.2).


1 fallocate allows prepending data to the file only at multiples of the filesystem block size. So it will solve your problem only if it's acceptable for you to pad the extra space with whitespace, comments, etc.


Without a support for fallocate()+FALLOC_FL_INSERT_RANGE the best you can do is

  1. Increase the file (so that it has its final size);
  2. mmap() the file;
  3. memmove() the data;
  4. Fill the header data in the beginning.
Leon
  • 31,443
  • 4
  • 72
  • 97
  • No matter how you read and write the file, it still has to be read into memory in its entirety and written out again; it makes no difference whether you use `memcopy` and `memmove` or `readline` and `print`. In fact this way will be much slower because you will be reading and writing on the same disk drive, whereas a `print` may be directed to a different hardware device. Even the use of `fallocate` with `FALLOC_FL_INSERT_RANGE` will take a couple of hours to allocate a few hundred bytes at the start of a 500GB file, because the data still has to be moved somehow. – Borodin Jun 17 '16 at 15:53
  • 1
    @Borodin File systems like ext4 use tree structures with pointers for allocating files (this is of course a big simplification). It is possible to insert blocks at the front without moving all the later blocks. – Kaz Jun 17 '16 at 19:58
  • Please explain why you're suggesting `mmap` and `memmove`, and why you think it's a better solution than `readline` and `print` in Perl – Borodin Jun 17 '16 at 21:25
  • @Borodin, ...well, the obvious answer to that question is that line-level operations are extra overhead even with buffering on top. Not that I'm convinced that `mmap()` + `memmove()` is the right thing, but I *am* convinced that `readline` + `print` is suboptimal; I don't see any reason to work at less than page-size chunks. – Charles Duffy Feb 06 '17 at 17:50
  • 1
    (heck, if you wanted to reduce the amount of temporary space, you could almost work backwards, allocating a sparse target file, reading large chunks from the end of the source to the destination and then truncating after successful `fdatasync()`; lots of various optimizations possible when one works with lower-level primitives). – Charles Duffy Feb 06 '17 at 17:52