1

I am trying to optimize our batch process to pull and insert data into a database. Currently, we have a data source that we pull our data from, create a text file, and load into our reporting database. We have that on a time schedule in Autosys, since most of the time, data is available by a certain time. However, lately, the data source has been late and we are not able to pull the data during the scheduled time and we have to manually run the shell script.

I want to have a shell script that runs the queries every hour and if the data exists, spools to a file to be loaded into the DB. If the data isn't there, then try again next hour so that we can eliminate any manual intervention.

I know I can set up a file trigger in Autosys to run the load into the database if the file exists, but I am having issues setting up the shell script only pull the data once it is available and not repeat the next hour if the file has already been spooled. I am new to UNIX so I am not sure how to proceed. Any help would be great.

codeforester
  • 39,467
  • 16
  • 112
  • 140
  • The important component here is the source of the data: How do you pull the data from that source? Is it a database, is it a file that is created automatically? Can you test for its presence without destroying it? Once that is known, you can replace the batch job with a loop that waits and checks for the data, processes it, and stops. – Henk Langeveld Mar 01 '17 at 22:11
  • It is a query that pulls from the source database (teradata), spools the file, and then loads into the reporting database (oracle). – user3517246 Mar 01 '17 at 23:57
  • I've never worked with Teradata, by did answer some questions about scripting around it. You would need to create a query for the teradata db that will tell you if the data you expect is present. Is there some key value you can use as a *sentinel* of sorts? – Henk Langeveld Mar 02 '17 at 20:02
  • Just another idea that I am currently using with [ActiveEon](http://www.activeeon.com)'s scheduler is to use a condition in your workflow to resubmit the job through their Rest API. This will allow you to choose when to resubmit it according to the number of time previously submitted. For instance, the next hour on first resubmission, then 30min on the second one, then 15min, etc. – XYZ123 Mar 09 '17 at 10:08
  • Theres no key value, it literally is just a data pull. – user3517246 Mar 10 '17 at 17:17

1 Answers1

1

You haven't stated your priority clearly. The priorities could be:

  • load the data as soon as it is available
  • load the data at least once every x minutes or hours
  • eliminate any need for manual intervention (which is clear from your question)

This is what you could do, assuming there is no need to load the data as soon as it is available:

  • increase the frequency of the Autosys job (instead of hourly, may be make it once in 30 or 15 minutes)
  • change the script so that:
    • it attempts to load only if it has been x minutes since last successful load, otherwise does nothing and ends in success
    • stores the last successful load timestamp in a file (which would be touched only upon a successful load)
    • if data doesn't arrive even after x + some buffer minutes, it might make more sense for the load job to fail so that it gets the required attention.
codeforester
  • 39,467
  • 16
  • 112
  • 140
  • Hi, so sorry, I've only ever posted up one other question. I need to spool the file if the data is available only once per day there for if the job runs at 10am, there is no data, then run again at 11am. If there is data, spool file, then do not run query again at 12pm. I've never used Autosys before, is there a way that if a file is spooled successfully, then the job marks as success? – user3517246 Mar 01 '17 at 21:58
  • So, if your goal is to load data only once per day, you could still run your job every hour and use the last data load timestamp to control the frequency of loading. I believe Autosys obtains a job's run status based on its exit code - 0 is success. – codeforester Mar 02 '17 at 04:04
  • Another point, if there are any other jobs that depend upon your load job, you could treat it as a failure when it has been more than 24 hours since the last load and there is no data available for loading. I just updated the answer to reflect this. – codeforester Mar 02 '17 at 04:05
  • 1
    Thank you! This helps a lot! – user3517246 Mar 02 '17 at 17:38