0

I am trying to take a very large txt file (over a million lines) that I created in Perl and run it through a different statement in Perl that will essentially look something like this (note the following is shell)

a=0
b=1
while read line;
do
    echo -n "" > "Write file"${b}
    a=($a + 1)
    while ( $a <= 5000)
    do
        echo $line >> "Write file"${b}
        a=($a + 1)
    done
    a=0
    b=($b + 1)
done < "read file"

Trying to size it down to 5k lines per file, and incrementing each time (filename1.txt, filename2.txt, filename3.txt, etc)
This doesn't seem to work in shell, possibly due to the size of the input file, and for the life of me I can't think of how to change what file I am writing to in the middle of the loop..

Logan Smith
  • 139
  • 1
  • 8

2 Answers2

5

You can just do this in the shell using split.

For example:

split -l 5000 filename.txt filename.txt.

will split filename.txt into multiple files with a max of 5,000 lines each. The output files will be names filename.txt.aa, filename.txt.ab, filename.txt.ac, etc.

From my man split:

NAME
     split -- split a file into pieces

SYNOPSIS
     split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]

DESCRIPTION
     The split utility reads the given file and breaks it up into files of 1000 lines each.  If file is a single dash (`-') or absent, split reads from the stan-
     dard input.

     The options are as follows:

     -a suffix_length
             Use suffix_length letters to form the suffix of the file name.

     -b byte_count[k|m]
             Create smaller files byte_count bytes in length.  If ``k'' is appended to the number, the file is split into byte_count kilobyte pieces.  If ``m'' is
             appended to the number, the file is split into byte_count megabyte pieces.

     -l line_count
             Create smaller files n lines in length.

     -p pattern
             The file is split whenever an input line matches pattern, which is interpreted as an extended regular expression.  The matching line will be the
             first line of the next output file.  This option is incompatible with the -b and -l options.

     If additional arguments are specified, the first is used as the name of the input file which is to be split.  If a second additional argument is specified,
     it is used as a prefix for the names of the files into which the file is split.  In this case, each file into which the file is split is named by the prefix
     followed by a lexically ordered suffix using suffix_length characters in the range ``a-z''.  If -a is not specified, two letters are used as the suffix.

     If the name argument is not specified, the file is split into lexically ordered files named with the prefix ``x'' and with suffixes as above.
Mike Covington
  • 2,147
  • 1
  • 16
  • 26
2

As an aside, this is your fixed script:

#!/bin/sh
a=0
b=1
while read line; do
    if [ $a -eq 0 ]; then
        echo -n '' > out-file-${b}
    fi

    echo $line >> out-file-${b}

    a=$(( $a + 1 ))
    if [ $a -eq 10 ]; then
        a=0
        b=$(( $b + 1 ))
    fi
done < in-file

Tested with bash and dash.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 1
    I'm surprised this was accepted as an answer. Using `split` would be far better. I would have posted this as a comment if I could have, since it was meant to help you in future endeavors, not this particular case. – ikegami Nov 25 '15 at 19:23
  • Because split will be far too messy for file naming when we are talking about hundreds of files resulting from a single "split".. I needed an incremental counter to keep things neat. – Logan Smith Dec 02 '15 at 17:51
  • How is that different than what `split` does? – ikegami Dec 02 '15 at 17:57
  • split will have it be file.txt.aa, file.txt.ab, and so on. For file transfers and automatic reading into a different team's unix box and tables that would not fly. At least not for the team I'm making this for... They wanted file0001.txt, file0002.txt, and so on, so that's why split wouldn't work for me otherwise I would have used it in the first place – Logan Smith Dec 02 '15 at 20:47