Split file by context and size in bash

Question

I have a set of large files that have to be split into 100MB parts. The problem I am running into is the fact that lines are terminated by the ^B ASCII (or \u002) character.

Thus, I need to be able to get 100MB parts (plus or minus a few bytes obviously) that also accounts for the line endings.

Example file:

000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B

The size of a "line" can vary in size.

I know of split and csplit, but couldn't wrap my head around combining the two.

#!/bin/bash
split -b 100m filename                              #splitting by size
csplit filename “/$(echo -e “\u002”)/+1” “{*}”      #splitting by context

Any suggestions on how I can do 100MB chunks that maintain the lines intact? As a side note, I am not able to change the line endings to a \n because that will corrupt the file as the data between ^B has to maintain the new line characters if present.

`$'\x02'` is a much more reliable way to emit that character as a literal. And note that "smart quotes" in your code (as opposed to literal ASCII quotes) will cause no end of problems. — Charles Duffy, Feb 12 '18 at 21:39
That is to say, to be meaningful in shell, code must use `"`, not `“` or `”`. — Charles Duffy, Feb 12 '18 at 21:41
If you are using GNU `split`, then `split -C 100m -t$'\x02'` will give you a files with at most 100MB, although a file could be much smaller if a really long "line" straddles the boundary. — chepner, Feb 12 '18 at 21:44
I just saw that. It must have changed when I copy-pasted in notepad. I didn't know about `$'\x02'`, thanks! — Vlad, Feb 12 '18 at 21:45

Charles Duffy · Accepted Answer · 2018-02-12T22:13:38.753

2

The following will implement your splitting logic in native bash -- not very fast to execute, but it'll work anywhere bash can be installed without needing 3rd-party tools to run:

#!/bin/bash

prefix=${1:-"out."}                        # first optional argument: output file prefix
max_size=${2:-$(( 1024 * 1024 * 100 ))}    # 2nd optional argument: size in bytes

cur_size=0                                 # running count: size of current chunk
file_num=1                                 # current numeric suffix; starting at 1
exec >"$prefix$file_num"                   # open first output file

while IFS= read -r -d $'\x02' piece; do    # as long as there's new input...
  printf '%s\x02' "$piece"                 # write it to our current output file      
  cur_size=$(( cur_size + ${#piece} + 1 )) # add its length to our counter
  if (( cur_size > max_size )); then       # if our counter is over our maximum size...
    (( ++file_num ))                       # increment the file counter
    exec >"$prefix$file_num"               # open a new output file
    cur_size=0                             # and reset the output size counter
  fi
done

if [[ $piece ]]; then  # if the end of input had content without a \x02 after it...
  printf '%s' "$piece" # ...write that trailing content to our output file.
fi

A version that relies on dd (the GNU version, here; could be changed to be portable), but which should be much faster with large inputs:

#!/bin/bash

prefix=${1:-"out."}                        # first optional argument: output file prefix

file_num=1                                 # current numeric suffix; starting at 1
exec >"$prefix$file_num"                   # open first output file

while true; do
  dd bs=1M count=100                       # tell GNU dd to copy 100MB from stdin to stdout
  if IFS= read -r -d $'\x02' piece; then   # read in bash to the next boundary
    printf '%s\x02' "$piece"               # write that segment to stdout
    exec >"$prefix$((++file_num))"         # re-open stdout to point to the next file
  else
    [[ $piece ]] && printf '%s' "$piece"   # write what's left after the last boundary
    break                                  # and stop
  fi
done

# if our last file is empty, delete it.
[[ -s $prefix$file_num ]] || rm -f -- "$prefix$file_num"

edited Feb 12 '18 at 22:13

answered Feb 12 '18 at 21:45

Charles Duffy

280,126
43
390
441

For all of bash's perceived slowness, there isn't really anything here that would be a problem. I may have to dabble and write a counterpart in C and time them. I suspect they would be close unless there is a significant difference in the default read buffer size. – David C. Rankin Feb 12 '18 at 21:56
@DavidC.Rankin, the performance of `read` is exactly the likely performance problem here -- the bash implementation goes a byte at a time, to avoid consuming more bytes than intended. That's useful at times -- if you want to, say, read input up to a given character and then hand the rest off to an external program -- but not so great from a performance perspective. – Charles Duffy Feb 12 '18 at 21:58
Ahh, then that could make for some fun cursor viewing over the course of a gigabyte or two -- thanks for the insight. – David C. Rankin Feb 12 '18 at 22:00
What we *could* do here (to take advantage of that behavior) is to, say, tell `dd` to copy 100mb from stdin to stdout, and then switch over to `read` to proceed from that boundary up to the next `\x02`. That would be a nice combination of speedy and requirement-meeting. – Charles Duffy Feb 12 '18 at 22:00
Yes, I'm still wrapping my head around that one, but I guess `dd` 100M scan forward until the next `\x02`, update file redirection and keep going and `dd` another 100M when we run out of bytes. (still not completely wrapped around all the details, but I like that approach) – David C. Rankin Feb 12 '18 at 22:07
@CharlesDuffy: Interesting idea. Would [parallel](https://www.gnu.org/software/parallel/) be an option as well? – l'L'l Feb 12 '18 at 22:08
Added the suggested implementation. @l'L'l, having once made the mistake of looking at its source, I detest GNU parallel; I'll leave it to someone else (maybe you?) to think about how to build an answer around it. – Charles Duffy Feb 12 '18 at 22:14
(to be fair, it's been decades since I saw a piece of perl I thought was tolerable, and my current self questions my late-90s self's judgment). – Charles Duffy Feb 12 '18 at 22:22
The script works, however its quite slow. It does answer my question as I never mentioned speed being a variable. Thanks! – Vlad Feb 12 '18 at 22:55
@Vlad, ...I'd expect the `dd` version towards the bottom to be much more acceptable re: performance. – Charles Duffy Feb 13 '18 at 00:18
Would putting an `export LC_ALL=C` in there somewhere ever avoid trouble? – jthill Feb 13 '18 at 17:37
Hmm. `LC_CTYPE` is probably a bigger risk than anything else in the category -- but the real concern is if NULs are allowed in the data; in that case, we can't safely `read` it into a shell variable (which are C strings, and thus NUL-delimited by nature) at all. And to be clear, I wouldn't expect `LC_CTYPE` to be *that* much of a problem; `read`'s behavior is other than NULs binary-safe, and `dd` similarly operates on bytes rather than characters in its default mode of operation – Charles Duffy Feb 13 '18 at 17:47
If you had a shell where `${#str}` was a character-counting rather than byte-counting operation, I suppose one could get caught. – Charles Duffy Feb 13 '18 at 17:50

Split file by context and size in bash

1 Answers1