How to start a DataStage Sequence job when a when file comes to the server

Question

I’m looking to build a process that triggers a DataStage Sequencer job when any file comes to the server’s landing zone. CA7 is the scheduler and the file naming convention comes in many different flavors, including the file extensions. Also, some file naming conventions contain date timestamp. I’m new to this activity so please bear with me if I ask silly follow on questions.

Thanks in advance for any help.

score 1 · Accepted Answer · answered Jan 04 '21 at 09:06

1

Check out the Wait for File stage in the Sequence

It has options to wait fr a file to appear (or disappear) and a timelimit before it times out. So you have to start the job at a certain time but the processing will start ince the file appears. The stage expects a filename though - but you could do a ls or similar command to get the filename and send that as a parameter to your job.

answered Jan 04 '21 at 09:06

MichaelTiefenbacher

3,805
2
11
17

Hello MichaelTiefenbacher. Thanks so much. - Can this be a continues process that starts when the server comes up? - What happens if multiple files lad at the same time? – tbtcust Jan 04 '21 at 09:30
For the server start you might create a script starting the job/sequence (check out dsjob command). Multiple files at the same time will require a multi instance job or a loop. – MichaelTiefenbacher Jan 04 '21 at 11:06

score 0 · Answer 2 · answered Mar 16 '21 at 01:16

If you need to process a few seldom files just-in-time, you can use the wait for file stage and schedule it in advance. If it's okay to process the files in bigger intervals, you can just schedule a job to run every interval like once a day, every hour or every minute and then process all files in the folder.

You mentioned that you have to deal with many different file names and extensions. I assume they're also of different structure. Beware of trying to build jobs that can handle all and everything.

Depending on the frequency, type and amount of files you expect to process, you have several methods to achieve best performance: either loop a few files in a sequence file-by-file and do complex stuff on each file, or read many files at once in a parallel job. Looping hundrets of files in a sequence having several jos within the loop could end up in very long coffee breaks.

If the task is to just move the files, maybe a shell script (-> command stage) is your friend.

But if you have tons of files (no matter what name) of the same structure (like csv files) and you need the content in a database, then you can read them all at once in a parallel-job using the sequential file stage and save them directly into a dataset. That stage allows you to select the files by pattern (maning that * is your friend in this case) and it can output the filename to a new field. So you'd end up with a DataSet containing your data and corresponding filenames.

Even if the files do not have the same structure, you can output the whole file content in one lob column and still process all reading in one job.

If you name the dataset dynamically, you can schedule another independend job to process the queue of DataSets in parallel for further processing.

How to start a DataStage Sequence job when a when file comes to the server

2 Answers2