0

We have 2 production environments say XX and XY. Both are on Oracle 11g Release 11.2.0.2.0. Both the environments have a queue table and a job queue to en-queue/de-queue jobs and process them. The application server is Apache Tomcat Version 6.0.32.

The problem I am facing is that in XX environment, the Oracle scheduler gets "hanged" or "stuck" almost daily at 4.30pm local time, whereas in XY environment it works perfectly fine. The entries for the scheduled jobs start accumulating in the queue table causing alerts to trigger off. Someone then has to re-start the application server after which the whole thing works perfectly fine and the pending jobs are processed without any issues. The DBA always says no issues in the database, but I am not really convinced.

Searching on google led me to this link : DBMS_SCHEDULER jobs stuck after upgrade to 11.2 And based on the explanation I checked and compared the DB parameter job_queue_processes. It has got the same values/details, so I am at a dead-end here.

Values for job_queue_processes

Once while re-starting the application I got the below exception on the output/display but not in the logs:

Mar 18, 2014 11:44:19 AM org.apache.catalina.startup.Catalina stopServer
SEVERE: Catalina.stop:
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at java.net.Socket.connect(Socket.java:478)
at java.net.Socket.<init>(Socket.java:375)
at java.net.Socket.<init>(Socket.java:189)
at org.apache.catalina.startup.Catalina.stopServer(Catalina.java:422)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.catalina.startup.Bootstrap.stopServer(Bootstrap.java:338)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:416)

So now I am wondering,

  • Is it the case that while jobs are getting enqueued, the connection to DB is getting lost in between, causing the Oracle scheduler to "hang" in between?
  • Is there any other database parameter I need to verify/validate? Is this is a known bug in Oracle? If yes is it documented anywhere? Sorry I may have missed out reading.
  • Is there any other information that I need to ask the DBA to check?

Thanks in advance for any help.

EDIT: Link for the output of the query => select * from DBA_SCHEDULER_JOBS is given in the excel at the link: Job_output.xlsx

Thanks!

Update as on 20/Mar/2014

The problem is not resolved as yet. Hence some more output from queries is posted below in the question.

SQL> select queue,msg_state,expiration_reason,count(*)
  2  from AQ$JOB_QTBL
  3  group by queue,msg_state,expiration_reason;

QUEUE                          MSG_STATE        EXPIRATION_REASON    COUNT(*)
------------------------------ ---------------- ------------------ ----------
JOB_Q                          READY                                        4

SQL> select owner, name, queue_type, max_retries, retry_delay
  2    from dba_queues
  3   where queue_table = 'JOB_QTBL'
  4   and owner = 'ENV_XX';

OWNER                          NAME                           QUEUE_TYPE         MAX_RETRIES RETRY_DELAY
------------------------------ ------------------------------ -------------------- ----------- -----------
ENV_XX                        AQ$_JOB_QTBL_E                 EXCEPTION_QUEUE                0           0
ENV_XX                        JOB_Q                          NORMAL_QUEUE                   5           0

SQL> 
KG3
  • 43
  • 1
  • 12
  • pls show the output for DBA_SCHEDULER_JOBS and DBA_SCHEDULER_RUNNING_CHAINS. Need more info, but a scheduler job getting "stuck" sounds like a stalled chain? just a guess – tbone Mar 18 '14 at 13:52
  • Since this is a live environment, I will have to mask some of the job names, and owner names before I post the information here. But in summary - DBA_SCHEDULER_RUNNING_CHAINS returned no rows, whereas DBA_SCHEDULER_JOBS returned 67 rows. I will mask some of the information and post the output here in some time. Also, do you want the output when the scheduler is not working, or do you want me to post the output "as is" just now? – KG3 Mar 18 '14 at 14:01
  • posting the output for dba_scheduler_jobs (for the jobs you are concerned about) would be a big help, don't have to wait for it to be stuck (although the "state" field may show clues when the job is stuck) – tbone Mar 18 '14 at 14:09
  • Hi, sorry; but how do I attach the excel/csv output of 67 rows of the query select * from DBA_SCHEDULER_JOBS; to this comment? – KG3 Mar 18 '14 at 14:25
  • you don't need to post all 67 rows, just the rows that pertain to the job(s) that you say are getting stuck. Edit your question with the result (not in comments) – tbone Mar 18 '14 at 14:39
  • Hi, I have edited the question and added the output of the query. The two jobs shown in the excel almost always stop executing. Could you please check the output and tell me what I am missing here? – KG3 Mar 19 '14 at 05:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/50036/discussion-between-kg3-and-tbone) – KG3 Mar 19 '14 at 13:06
  • After restarting the app server all is working fine ....: So, what exactly does the app server do to accomplish that? (assuming no db restart) –  Mar 20 '14 at 15:09
  • The application server is Apache Tomcat server, which runs the entire application. It basically picks up zip files from a pre-defined folder on the file server, unzips them, loads them in the db and processes them. After that it will send out the mail to all stake holders after the processing is complete. In this case, we can see the request to pick up the zip file getting enqueued in the queue table, but for some reason, it is not getting dequeued correctly. After re-start, the pending jobs in the queue table get dequeued correctly and execute without any problem. – KG3 Mar 21 '14 at 07:36
  • To rule out any other conflicting issue (other db backup, process etc.), what we have done is changed the time of this particular job from 10.00AM UTC to 12.00PM UTC. I will check if the problem is resolved today after this change and let you know – KG3 Mar 21 '14 at 07:36
  • Changing the time of the job did not make any difference. The same scenario was observed. We have asked the DBA to analyze further now. I will post the solution and steps taken to resolve the problem here, as and when the DBA completes his analysis. – KG3 Mar 24 '14 at 07:42
  • Still no luck. The problem persists as of today i.e. 25.03.2014. No clue from DBA as well. – KG3 Mar 25 '14 at 11:20
  • While monitoring the DB, the DBA's have observed that previously there were two sessions running ENV_XX (program = JDBC thin client). Both the sessions are waiting for queue request. After some time (1-2 hrs later), only one session is running and no issue reported. The assumption is that, issue might be occurring due to these two sessions. Once they get the request in queue table, both try to execute same job causing the scheduler to get "hanged". However this is only a possibility, not verified, as the scheduler did not get "hanged" in the last 1-2 days. – KG3 Mar 28 '14 at 12:05
  • Just to update here: We checked the code and found that ultimately the procedure executes `DBMS_AQ.ENQUEUE` procedure with all the parameters. Hence, as of now our DBA's have raised the query with Oracle Support. If the root cause is found, I will update the reason here. Thanks to all those who have helped us. – KG3 Apr 25 '14 at 06:47

0 Answers0