0

I need to process a long text file splitting it into many smaller files. I have a single pass while - read - done <inputfile loop and when a line is matched, that signals the start of new output file. The matched lines are always preceded by a newline character in the input file.

My problem is that the output files (except the final one) are extended by a newline character. I have recreated the problem in this short example.

#!/bin/zsh

rm inputfile outputfile1 outputfile2
IFS=''
printf "section1\nsection1end\n\nsection2\nsection2end\n" >inputfile

echo "  open outputfile1"
exec 3<> outputfile1
counter=1
IFS=$'\n'

while IFS= read line; do

    if [[ "$line" == "section2" ]]; then
        echo "  Matched start of section2. Close outputfile1 and open outputfile2"
        exec 3>&-
        exec 3<> outputfile2
    fi
    echo "$line" >&3
    echo $counter $line
    let "counter = $counter + 1"
done <inputfile
echo "  Close outputfile2"
exec 3>&-

echo
unset IFS
echo `wc -l inputfile`
echo `wc -l outputfile1`
echo `wc -l outputfile2`
echo "  The above should show 5, 2, 2 as desired number of newlines in these files."

Which outputs:

  open outputfile1
1 section1
2 section1end
3
  Matched start of section2. Close outputfile1 and open outputfile2
4 section2
5 section2end
  Close outputfile2

5 inputfile
3 outputfile1
2 outputfile2
  The above should show 5, 2, 2 as desired number of newlines in these files.
kometen
  • 6,536
  • 6
  • 41
  • 51
SteveGroom
  • 129
  • 1
  • 15
  • Will the command line utility split be able to perform what you want? – kometen Sep 05 '21 at 09:56
  • My actual code has a series of extended regexes to detect the section changes - I don’t think I can get split to work with such patterns. – SteveGroom Sep 05 '21 at 10:32
  • @kometen - Thinking more about it I tried to split the file with an ERE and would then have to move and rename the resulting files. The regex matches lines that occur after a blank line. ```ksh split -p "^[## Foreword|## [0-9]+\.|## Appendix [0-9]+|### [0-9]+.[0-9]+\.]" draft6.md ``` It produced 370 files and all be the last one have two blank lines at the end! Oh well. – SteveGroom Sep 05 '21 at 12:25
  • You may want to re-title this. If I am following correctly, the shell script isn't adding unwanted newlines, it is *retaining* unwanted newlines. The newlines were present in the original input. – Gairfowl Sep 05 '21 at 13:38
  • I added some `zsh` options in an answer below, but you might have better luck doing this with an `awk` script, e.g. like example 5 here: https://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html – Gairfowl Sep 05 '21 at 14:24
  • @Gairfowl the single blank lines in the input file should be retained, but after the script runs there are double blank lines. – SteveGroom Sep 05 '21 at 15:30

1 Answers1

0

Option 1

Get rid of all empty lines. This only works if you don't need to retain any of the empty lines in the middle of a section. Change:

    echo "$line" >&3

To:

    [[ -n "$line" ]] && echo "$line" >&3

Option 2

Rewrite each file using command substitution to trim any trailing newlines. Works best with short files. Change:

        exec 3>&-
        exec 3<> outputfile2

To:

        exec 3>&-
        data=$(<outputfile1)
        echo "$data" >outputfile1
        exec 3<> outputfile2

Option 3

Have the loop write the line from the prior iteration, and then do not write the final line from the prior file when you start a new file:

#!/bin/zsh

rm inputfile outputfile1 outputfile2
IFS=''
printf "section1\nsection1end\n\nsection2\nsection2end\n" >inputfile

echo "  open outputfile1"
exec 3<> outputfile1
counter=1
IFS=$'\n'

priorLine=MARKER
while IFS= read line; do
    if [[ "$line" == "section2" ]]; then
        echo "  Matched start of section2. Close outputfile1 and open outputfile2"
        exec 3>&-
        exec 3<> outputfile2
    elif [[ "$priorLine" != MARKER ]]; then
        echo "$priorLine" >&3
    fi
    echo $counter $line
    let "counter = $counter + 1"
    priorLine="$line"
done <inputfile
echo "$priorLine" >&3
echo "  Close outputfile2"
exec 3>&-

echo
unset IFS
echo `wc -l inputfile`
echo `wc -l outputfile1`
echo `wc -l outputfile2`
echo "  The above should show 5, 2, 2 as desired number of newlines in these files."
Gairfowl
  • 2,226
  • 6
  • 9
  • Thank you @Gairfowl - option two works for me. The input file has plenty of required blank lines (it's for splitting a massive markdown doc into chapters) - so won't use option 1. Option three was what I was trying to avoid - tracking prior line state is not such a clean solution. **Option 2 - great** - simply rewrite the file after its closed to get rid of the spurious line. – SteveGroom Sep 05 '21 at 15:36