0

There is a ETL job dealing with over 43000000 rows and it often fails because of APT_BadAlloc when it process a JOIN stage. Here is the log.

Join_Stage,0: terminate called after throwing an instance of 'APT_BadAlloc'
Issuing abort after 1 warnings logged.
Join_Stage,3: Caught exception from runLocally(): APT_Operator::UnControlledTermination: From: UnControlledTermination via exception...
Join_Stage,3: Caught exception from runLocally(): APT_Operator::UnControlledTermination: From: UnControlledTermination via exception...
Join_Stage,3: The runLocally() of the operator failed.
Join_Stage,3: Operator terminated abnormally: runLocally() did not return APT_StatusOk
Join_Stage,0: Internal Error: (shbuf): iomgr/iomgr.C: 2670 

My question is about the first warning. The event type is warning and message ID is IIS-DSEE-USBP-00002.

Join_Stage,0: terminate called after throwing an instance of 'APT_BadAlloc'

After this warning, the job has failed and it often occurs. However, I couldn't figure out how to fix it. Our team only has the solution for this error is to give at least 10 - 15 minute break time and then the ETL job is restarted. Mostly it is effective way resolving the issue. However, it is not a permanent solution, so I'm googling every day but I can't find out what is my first step to resolve the error and how to do it at all.

I checked out APT_DUMP_SCORE on the administrator. Currently, it set FALSE. BTW, if I set the option TRUE, where and how to read the dump score report? Our server is a linux server and ETL developers are not system admin for the server. Is there an option on Data Stage Designer (client) to see the DUMP SCORE report? I read about the report on the IBM website. https://www.ibm.com/docs/en/iis/11.5?topic=flow-apt-dump-score-report But, I couldn't find the location of the report. Is it provided job log area?

1. Log/View enter image description here

2. APT_DUMP_SCORE options enter image description here

I also saw some options about Buffer size for the system. All the size has the default values. It is very important setting, so I couldn't touch any option here. Please let me know how I can figure out the root cause.

I'm not a system admin. I have to contact someone else who can look into a detailed log file about the biggest row in the dataflow.

3. System Buffer Size settings

enter image description here

FYI. I clicked the Resource Estimation menu against our testing server. But it requires too much resource to perform the estimation, so I couldn't get the estimation through the menu.

4. Resource Estimation menu on Data Stage Designer

enter image description here

llearner
  • 37
  • 5

2 Answers2

1

The DUMP SCORE will be logged to your job log - as shown in the link you already mentioned. You need to check the log details - double click the entry starting with something like "main_program" - usually within the first 5 entries of the jobrun. This means of course the jobs needs to run after you set the APT_DUMP_SCORE to YES.

Your main problem seems to be a lack of memory when the job gets executed. Add more memory or ensure that fewer job gets executed in parallel when this job is started an run.

MichaelTiefenbacher
  • 3,805
  • 2
  • 11
  • 17
  • thank you so much for sharing your knowledge with me. How can I reduce the number of jobs when the specific job gets started? I’m using a sequence and this job starts with at least one job at the same time. (I have to check it out when I’m back to work) One simple way for me is to separate these two jobs in the sequence, when Job 1 finishes and sends OK and then this job gets started. Would it be good for testing? I’ll try it out. – llearner Mar 05 '23 at 20:07
  • The jobs can be limited within DataStage Workload Manager for example. – MichaelTiefenbacher Mar 06 '23 at 17:48
  • Thank you for the insights about Data Stage and Data Engineering. I also searched how I can manage the number of jobs get started and found out the menu. Our Job Start (maximum starting jobs) is 100 jobs in 10 seconds. The Job Count (maximum running jobs) 20 which means 20 concurrent running jobs allowed on the system. I have to ask why our team set 100 jobs in 10 seconds in the previous years. BTW, when I saw the DS Director, 3 concurrent running jobs are at the failed time. One of them is the issue what I posted. We use 55GB Physical - 31 GB Virtual memory. – llearner Mar 06 '23 at 22:43
0

Sometimes setting these environment variables helps. APT_DISABLE_COMBINATION=TRUE and APT_DISABLE_FASTALLOC=1

-or- It appears that you are using 4 compute nodes. You can try using 2 players instead. This uses less memory, but will be slower.

-or- Another option would be to A) edit the job design, and B) open each Join stage, and C) in Stage properties on Advanced tab, set Execution mode to Sequential. Your datasets and database stages will still read and write in parallel on all nodes, but when data is sent to the Join stage, it will re-partition down to 1 node. This strategy saves on memory usage, but might take the job more time to finish.

Regards, Emily