5

I have a 226GB log file, and I want to split it up into chunks for easier xzing. Problem is I only have 177GB of space left in my workable space.

Is there a way to split a file in half or into N number of chunks without keeping an additional copy of the original?

    $ split myFile.txt
    $ ls -halF
-rw-r--r--   1 user group 35 Dec 29 13:17 myFile.txt
-rw-r--r--   1 user group 8 Dec 29 13:18 xaa
-rw-r--r--   1 user group 3 Dec 29 13:18 xab
-rw-r--r--   1 user group 5 Dec 29 13:18 xac
-rw-r--r--   1 user group 10 Dec 29 13:18 xad
-rw-r--r--   1 user group 8 Dec 29 13:18 xae
-rw-r--r--   1 user group 1 Dec 29 13:18 xaf

I would rather just have no myFile.txt left over, and have the split files only. I would gladly just stick with the default behavior and delete the original, but I just don't have the space available to work in to accomplish that.

I'm not an expert at sed or awk but I thought maybe there would be a way to "remove into another file" kind of behavior that could be achieved with one of them?

masegaloeh
  • 18,236
  • 10
  • 57
  • 106
Ceafin
  • 61
  • 1
  • 4
  • 1
    Just copy the file to a different system and deal with it offline...and then make sure you have a decent, working log rotation strategy. – EEAA Dec 29 '14 at 19:48
  • So am I to interpret this as my query is not possible? And I feel the comment about having a decent working log rotation strategy is a little hind-sighted. Obviously, everyone wants an optimized working log rotation strategy; however, I am working with an anomaly and within unexpected constraints. Rather than saying "Oh well, should'a could'a would'a" and giving up, I am trying to work within the bounds I was given first. – Ceafin Dec 29 '14 at 19:54
  • I wouldn't bother trying to split it up at this point...just copy it offline and sort it out. It's not likely worth the time and IO contention it'll take to get it done on the source server. – EEAA Dec 29 '14 at 19:56
  • 2
    Honestly, I agree. I am just trying to see if I can do some tight gymnastics within this hole I was given before requesting more resources. – Ceafin Dec 29 '14 at 19:58
  • There appears to be a `truncate` command which can make a large file smaller by cutting off the end part. If you have copied that part to a target file beforehand you can do the cutting up part without using more extra space than a single target file requires. – Thorbjørn Ravn Andersen Dec 31 '14 at 00:00
  • 1
    this works: http://superuser.com/questions/177823/are-there-any-tools-in-linux-for-splitting-a-file-in-place – Oleg Mikheev Dec 31 '14 at 17:41

2 Answers2

2

What might work is to stream parts of it directly into xz - I guess you can compress a log file good enough to fit both the original and the compressed parts into your space left.

  1. Get the number of lines:

    wc -l myFile.txt
    
  2. Divide this into as many parts as you like, e.g. 10k lines per part.
  3. Use sed to pipe the part you want into xz:

    sed -n '1,10000p' myFile.txt | xz > outfile01.xz 
    sed -n '10001,20000p' myFile.txt | xz > outfile02.xz
    

etc. This could be done by a script of course.

But honestly, do as EEAA said...

Sven
  • 98,649
  • 14
  • 180
  • 226
  • This was what I was thinking could work! And must try when the data isn't so damn important, ha! But I'm going to just pull it off the box, split, compress, and push back to the archive -via EEAA – Ceafin Dec 29 '14 at 21:04
  • I usually find with bzip2 that I often get 20 to 1 compression, so compress it, then use something to stream uncompress it, and chunk from there if you need to. – Ronald Pottol Dec 29 '14 at 22:13
1

You could do successive incarnations of tail and truncate to carve off chunks from the end of the massive file.

Something like

tail -n 10000 myFile.txt > myFile.001.txt
truncate -s -$(wc -c myFile.001.txt) myFile.txt
xz myFile.001.txt
rm myFile.001.txt

You could also script that. It'll probably take a while to run, though, and it'd be much better to just deal with it off-box.

Michael Lowman
  • 3,604
  • 20
  • 36