I am trying to split a dozen 100MB+ csv files into managable smaller files for a curl post.
I have managed to do it but with a lot of temporary files and IO. It's taking an eternity.
I am hoping someone can show me a way to do this much more effectively; preferably with little to no disk io #!/bin/sh
for csv in $(ls *.csv); do
tail -n +2 $csv | split -a 5 -l - $RANDOM.split.
done
# chose a file randomly to fetch the header from
header=$(ls *.csv |sort -R |tail -1 | cut -d',' -f1)
mkdir split
for x in $(/usr/bin/find . -maxdepth 1 -type f -name '*.split.*'); do
echo Processing $x
cat header $x >> split/$x
rm -f $x
done
The above script may not entirely work. I basically got it working through a combination of these commands.
I decided to make the curl POST another step entirely in the case of upload failure; I didn't want to lose the data if it were all posted. But, if, say, on error from curl the data could be put into a redo folder then that can work.
#!/bin/sh
# working on a progress indicator as a percentage. Never finished.
count=$(ls -1 | wc -l 2> /dev/null | cut -d' ' -f1)
for file in $(/usr/bin/find . -maxdepth 1 -type f); do
echo Processing $file
curl -XPOST --data-binary @$file -H "Content-Type: text/cms+csv" $1
done
Edit 1 -- why the RANDOM? because split is going to produce the exact same files when it splits the next file as it did for the first. so .. aa ab ac ... will be produced for every file. I need to ensure every file produced by split is unique for the entire run