How can I reduce CPU in SORT operation

Question

I am using DFSORT to copy the Tape data-set to a temp file, and processing around 80000000 records. Its taking 3 Hours to just copy the data-sets. is there any other way around to reduce the CPU time. Suggestions will be very helpful. Thank You.

    //STEP40  EXEC SORTD                                              
    //SORTIN   DD DSN=FILEONE(0),                           
    //            DISP=SHR                                            
    //SORTOUT  DD DSN=&&TEMP,                                       
    //            DISP=(NEW,PASS,DELETE),                          
    //            DCB=(RECFM=FB,LRECL=30050,BLKSIZE=0),               
    //            UNIT=TAPE                                           
    //SYSOUT   DD SYSOUT=*                                            
    //SYSPRINT DD SYSOUT=*                                            
    //SYSIN    DD *                                                   
         SORT FIELDS=(14,6,PD,A,8,6,PD,A,45,2,ZD,A)                   
         OUTREC IFTHEN=(WHEN=(70,18,CH,EQ,C' encoding="IBM037"'),     
                     OVERLAY=(70:C'  encoding="UTF-8"'))              
         OPTION DYNALLOC=(SYSDA,255)                                  
    /*

Is the LRECL really `30050` ??? and as phunsoft said does it need to be written to the tape. `Check the attributes of the input file` — Bruce Martin, Jul 28 '18 at 12:20
@NITISHSINGH This looks like the example you sahared earlier where you were trying to replace the "IBM037" for the encoding. Is the format 1 xml document per record ? — Hogstrom, Aug 07 '18 at 16:15

score 2 · Answer 1 · answered Aug 17 '18 at 17:03

I love diagnosing these kinds of problems...

80M records at 30K each is about 2.5TB, and since you're reading and writing this data, s you're processing a minimum of 5TB (not including I/O to the work files). If I'm doing my math right, this averages 500MB/second over three hours.

First thing to do is understand whether DFSORT is really actively running for 3 hours, or if there are sources of wait time. For instance, if your tapes are multi-volume datasets, then there may be wait time for tape mounts. Look for this in the joblog messages - might be that 20 minutes of your 3 hours is simply waiting for the right tapes to be mounted.

You may also have a CPU usage problem adding to the wait time. Depending on how your system is setup, your job might be only getting a small slice of CPU time and waiting the rest of the time. You can tell by looking at the CPU time consumed (it's also in the joblog messages) and comparing it to the elapsed time...for instance, if your job gets 1000 CPU seconds (TCB + SRB) over 3 hours, you're averaging 9% CPU usage over that time. It may be that submitting your job in a different job class makes a difference - ask your local systems programmer.

Of course, 9% CPU time might not be a problem - your job is likely heavily I/O bound, so a lot of the wait time is about waiting for I/O to complete, not waiting for more CPU time. What you really want to know is whether your wait time is waiting for CPU access, waiting for I/O or some other reason. Again, your local systems programmer should be able to help you answer this if he knows how to read the RMF reports.

Next thing to do is understand your I/O a little better with a goal of reducing the overall number of physical I/O operations that need to be performed and/or making every I/O run a little faster.

Think of it this way: every physical I/O is going to take a minimum of maybe 2-3 milliseconds. In your worst case, if every one of those 160M records you're reading/writing were to take 3ms, the elapsed time would be 160,000,000 X .003 = 480,000 seconds, or five and a half days!

As another responder mentions, blocksize and buffering are your friends. Since most of the time in an I/O operation comes down to firing off the I/O and waiting for the response, a "big I/O" doesn't take all that much longer than a "small I/O". Generally, you want to do as few and as large physical I/O operations as possible to push elapsed time down.

Depending on the type of tape device you're using, you should be able to get up to 256K blocksizes on your tape - that's 7 records per I/O. Your BLKSIZE=0 might already be getting you this, depending how your system is configured. Note though that this is device dependent, and watch out if your site happens to use one of the virtual tape products that map "real" tape drives to disk...here, blocksizes over a certain limit (32K) tend to run slower.

Buffering is unfortunately more complex than the previous answer suggested...turns out BUFNO is for relatively simple applications using IBM's QSAM access method - and this isn't what DFSORT does. Indeed, DFSORT is quite smart about how it does it's I/O, and it dynamically creates buffers based on available memory. Still, you might try running your job in a larger region (for instance, REGION=0 in your JCL) and you might find DFSORT options like MAINSIZE=MAX help - see this link for more information.

As for your disk I/O (which includes those SORTWK datasets), there are lots of options here too. Your 30K LRECL limits what you can do for blocking to a good degree, but there are all sorts of disk tuning exercises you can go through, from using VIO datasets to PAVs (parallel access volumes). Point is, a lot of this is also configuration-specific, and so the right answer is going to depend on what your site has and how it's all configured.

But maybe the most important thing is that you don't want to go at it purely trial and error until you stumble across the right answer. If you want to learn, get familiar with RMF or whatever performance management tools your site has (or find a systems programmer that's willing to work with you) and dig in. Ask yourself, what's the bottleneck - why isn't this job running faster? Then find the bottleneck, fix it and move on to the next one. These are tremendous skills to have, and once you know the basics, it stops feeling like a black art, and more like a systematic process you can follow with anything.

More on this topic at this link https://stackoverflow.com/q/51840054/6943197 I'm with you on solving big problems like this. M gut feel is that the problem is that he's way under blocked and I/O is inefficient. Suggested 2 tape units to improve the possibility of a quick switch to reduce time. I'm not familiar with Syncsort, but in the next post you'll see the defaults for SORTWK are woefully under specified. Love to see your thoughts with the additional info. — Hogstrom, Aug 18 '18 at 00:39
@Hogstrom: One complication here is that I believe Syncsort uses their own IOS driver instead of "normal" I/O, so they don't necessarily honor things like BUFNO the way you might expect. BUFNO is really a QSAM thing, although many "smart" applications see that you've coded it and use it in BSAM or EXCP processing too - really up to the app developer. With Syncsort, I think the strategy they use is about "how much memory do I have, and what's the most efficient use of it?"...sort work space (in memory), buffers, etc. Thus, you need more than JCL BUFNO - you need corresponding SORT options. — Valerie R, Aug 18 '18 at 15:37
That's fair @valerie-r. My bigger concern is the I/O which you've pointed out. The LRECL and FB characteristics seems to me to be a significant bottleneck. I think moving to VB would be far more efficient assuming the XML docs are truly variable length. If not, well, it is what it is. — Hogstrom, Aug 18 '18 at 16:03
Yes, I agree with @Hogstrom...converting to VB/VBS would likely be a dramatic improvement if the actual data is on average less than the current fixed 30K LRECL. Of course, the tricky part is the conversion - we haven't seen how these records are created, and likely something in that part of the process would need to change. I think the OP said they were XML...likely the fixed records are an XML payload padded with blanks or nulls. and that would need to be changed to just the payload with the length in the RDW and minus any padding. Still, it would be worthwhile in my opinion. — Valerie R, Aug 19 '18 at 22:59

score 1 · Answer 2 · answered Jul 28 '18 at 11:54

1

Since you write

... it takes 3 hours to complete...

I guess what you really want is to reduce elapsed time, not CPU time. Elapsed time depends on many factors such as machine configuration, machine speed, total system load, priority of your job, etc. Without more information about the environment, it is difficult to give advice.

However, I see you're writting the sort output to a temporary data set. I conclude, there is another step to read that data in. Why do you write this data to tape? Disk will surely be faster and reduce elapsed time.

Peter

answered Jul 28 '18 at 11:54

phunsoft

2,674
1
11
22

Yes I am passing the temp Data-set to another step,This step alone is taking 3 hours to execute, so is there any other way like tuning or some other utility, to reduce execution time? – NITISH SINGH Jul 28 '18 at 12:29
Can you post the IEF032I messge for the step? It will tell us how much CPU time was used during the 3 hour executin time. – phunsoft Jul 28 '18 at 17:10
Quote: "This step alone is taking 3 hours to execute". This is ambiguous. Which step: the copy step or the next step? – NicC Jul 30 '18 at 08:44
@NicC the step I have mentioned above is taking three hours to execute.. – NITISH SINGH Jul 30 '18 at 10:54
@phunsoft IEF234E K A220,O82247,PVT,JHW807##,STEP40 IEF234E R A230,P68559,PVT,JHW807##,STEP40 IEF233A M A220,M59594,,JHW807##,SORTD,FORCEX.PRXM.FILE1.G0049V00 – NITISH SINGH Jul 30 '18 at 11:09
@NITISH SINGH - Why do you post the unmount message? I was asking for the step end message IEF032I. Note that this is a multiline message. – phunsoft Jul 30 '18 at 14:03
@phunsoft IEF032I STEP/SORTD /STOP 2018167.2236 CPU: 0 HR 05 MIN 20.44 SEC SRB: 0 HR 00 MIN 01.98 SEC VIRT: 1032K SYS: 960K EXT: 306292K SYS: 11436K ATB- REAL: 1108K SLOTS: 0K VIRT- ALLOC: 16M SHRD: 0M – NITISH SINGH Aug 01 '18 at 10:58
The sort step uses only 321 seconds of CPU time during the 3 hours it runs. So, unless your machine is really busy, and your job runs with a discretionary goal, I'd say there is not much to do with respect to the CPU usage. However, if your data set is really 80 mio record of 30050 bytes each, that is a huge data set (~2.2 TiB). You're reading this from tape, and you're writing it to tape. Your tape infrastructure may simply not be able to process this amount of data quicker. – phunsoft Aug 01 '18 at 11:22
How many tape mounts are there in total? How long does one mount, or unmount take? – phunsoft Aug 01 '18 at 11:23

Hogstrom · Accepted Answer · 2018-08-07T16:22:52.353

A few comments on improving I/O performance which should improve your overall elapsed time.

On your SORTIN and SORTOUT DD statement add the following to your DCB.

From IBM's MVS JCL Manual on page 143.

//SORTIN   DD DSN=FILEONE(0),                           
//            DISP=SHR<b>,DCB=BUFNO=192</b>                                            
//SORTOUT  DD DSN=&&TEMP,                                       
//            DISP=(NEW,PASS,DELETE),                          
//            DCB=(RECFM=FB,LRECL=30050,BLKSIZE=0,BUFNO=192),
//            UNIT=TAPE

I chose 192 as its relatively cheap in terms of memory these days. Adjust for your environment. This essentially tells the system how many blocks to read with each I/O which reduces time related to I/O operations. You can play with this number to get an optimal result. The default is 5.

BUFNO=buffers
Specifies the number of buffers to be assigned to the DCB. The maximum normally is 255, but can be less because of the size of the region. Note: Do not code the BUFNO subparameter with DCB subparameters BUFIN, BUFOUT, or DD parameter QNAME.

You might consider the blocksize's. The blocksize on the output seems odd. Ensure that it is optimized for the device you are going to. For TAPE devices this should be as large as possible. For 3480 or 3490 devices this can be as large as 65535. You do not specify the LRECL but indicate that its 30050 then you could specify a BLKZIE of 60100 which would be two records per block. Better I/O efficiency.

Here is more information on BLKSIZE selection for tapes.


3490 Emulation (VTS)    262144 (256 KB)
3590                    262144 (256 KB) (note: on some older models the limit is  
                                               229376 (224 KB) 262144 (256 KB)

Last quick hint if you are actually using TAPE is to specify multiple TAPE devices. This will allow for one tape to be written to while mounting the next one. I've included the BUFNO example here as well:

//SORTOUT DD DSN=&&TEMP, // DISP=(NEW,PASS,DELETE), // DCB=(RECFM=FB,LRECL=30050,BLKSIZE=0,BUFNO=192), // UNIT=(TAPE,2)

Of course these optimizations depend on your physical environment and DFSMS setup.

Thanks you for this descriptive comment @Hogstrom, I tried the the same with the hints you provided, but now every time iI m getting "WER046A SORT CAPACITY EXCEEDED" — NITISH SINGH, Aug 14 '18 at 09:44

How can I reduce CPU in SORT operation

3 Answers3

Linked