splitting a CSV and keeping the header without intermediate files

Question

I am trying to split a dozen 100MB+ csv files into managable smaller files for a curl post.

I have managed to do it but with a lot of temporary files and IO. It's taking an eternity.

I am hoping someone can show me a way to do this much more effectively; preferably with little to no disk io #!/bin/sh

for csv in $(ls *.csv); do
    tail -n +2 $csv | split -a 5 -l - $RANDOM.split.
done

# chose a file randomly to fetch the header from   

header=$(ls *.csv |sort -R |tail -1 | cut -d',' -f1)

mkdir split

for x in $(/usr/bin/find . -maxdepth 1 -type f -name '*.split.*'); do
    echo Processing $x
    cat header $x >> split/$x
    rm -f $x
done

The above script may not entirely work. I basically got it working through a combination of these commands.

I decided to make the curl POST another step entirely in the case of upload failure; I didn't want to lose the data if it were all posted. But, if, say, on error from curl the data could be put into a redo folder then that can work.

#!/bin/sh

# working on a progress indicator as a percentage. Never finished.
count=$(ls -1 | wc -l 2> /dev/null | cut -d' ' -f1)

for file in $(/usr/bin/find . -maxdepth 1 -type f); do
    echo Processing $file
    curl -XPOST --data-binary @$file -H "Content-Type: text/cms+csv" $1
done

Edit 1 -- why the RANDOM? because split is going to produce the exact same files when it splits the next file as it did for the first. so .. aa ab ac ... will be produced for every file. I need to ensure every file produced by split is unique for the entire run

you're always going to have I/O writing a new version of a file from an existing file. Making that process as efficient as possible should be your focus. As is, there is too much "other" stuff in your question. (Why $RANDOM, is that really a requirement of your solution, or are you experimenting. Other things aren't clear either). Maybe include an ascii art of the input file structure followed by the output files expected from those inputs. (Just a small sample set). Good luck. — shellter, Nov 03 '14 at 05:14
split will always produce the same files when splitting. I need the random to ensure every file split produces unique split files — Christian Bongiorno, Nov 03 '14 at 05:19
It's late for me, I can't picture what you're trying to do from your verbal description, but I'm sure others will pile on shortly with workable solutions. Good luck! — shellter, Nov 03 '14 at 05:21
This question has been cross-posted at Unix.SE: http://unix.stackexchange.com/questions/165632/splitting-a-csv-and-keeping-the-header-without-intermediate-files — John1024, Nov 03 '14 at 05:31

score 0 · Answer 1 · answered Nov 03 '14 at 10:00

Not quite sure what you want to accomplish, but it seems to me that you are processing line by line. Thus, if you serialize all your csv files and lines, you can do it without disk I/O. Yet from your descriptions, I can't tell if this script runs many instances or just one instance (multiple processes or one process). Thus I can just try my best to mimic your script to reach as similar results as possible, yet to resolve the disk I/O problem. The codes are provided below, yet please correct script error if any, as I have no way to run/debug/verify it:

for csv in $(ls *.csv | sort -R); do
    # first read line skip the first line, since I see your tail -n +2 command.
    (read line;
     count=0;
     while read line; do
         Processing $line;
         count=$(($count + 1));
         echo $csv.$count >> split/$count;
     done
    ) < $csv
done

Your 'Processing' code now should process from a verbose line, rather than a file. Perhaps a pipe and have your Processing to process on STDIN will do the trick:

echo $line | Processing

Your curl can do similar way, to process from STDIN, by replacing @$file with -, and you can print what you want curl to send and then pipe it to curl, similar to this:

ProcessingAndPrint | curl -XPOST --data-binary - -H "Content-Type: text/cms+csv" $1

In your split, you use -1, to split just one line for each split file, i.e. line-by-line split. Equivalently, I use a while loop + read + I/O redirection, i.e. "<", to read line-by-line. It's an on-the-fly read + processing. — Robin Hsu, Nov 05 '14 at 01:43

splitting a CSV and keeping the header without intermediate files

1 Answers1