1

I have a job that successfully produces a sequential file (CSV) output with some hundred million rows, can someone provide an example where the output is written to a hundred separate sequential files, each with a million rows?

What does the sequential file stage look like, how is it configured?

This is to ultimately allow QA to review any one of the individual outputs without a special text editor that can view large text files.

Community
  • 1
  • 1
lojkyelo
  • 121
  • 3
  • 15
  • If you get really desperate, you can make a custom step that pipes the results into the `split` Unix command. Something like `split -l=1000000` should work for your situation. – Mr. Llama Jun 17 '14 at 14:29

1 Answers1

1

Based on the suggestion from @Mr. Llama and a lack of forthcoming solutions we decided on a simple script to be executed at the end of the scheduled DataStage event.

#!/bin/bash
# usage:
# sh ./[script] [input]

# check for input:
if [ ! $# == 1 ]; then
  echo "No input file provided."
  exit
fi

# directory for output:
mkdir split

# header without content:
head -n 1 $1 > header.csv

# content without header:
tail +2 $1 > content.csv

# split content into 100000 record files:
split -l 100000 content.csv split/data_

# loop through the new split files, adding the header
# and a '.csv' extension:
for f in split/*; do cat header.csv $f > $f.csv; rm $f; done;

# remove the temporary files:
rm header.csv
rm content.csv

Crude but works for us in this case.

lojkyelo
  • 121
  • 3
  • 15