Find matched and unmatched records and position of key-word is unknown

Question

I have two files FILE1 & FILE2, and lets say both are fixed length of 30 characters. I need to find the records from FILE1 & FILE2 which contain the string 'COBOL', where the position of this key-word is unknown and changes for every record. To be more clear below is as sample layout.

FILE1 :

NVWGNLVKAKOIVCOBOLLKASVOIWSNVS           
SOSIVSNAVIS7780HLSVHSKSCOBOL56           
ZXCVBNMASDFGHJJKKLIIUYYTRREEWQ          
1234567890COBOL1234556FCVHJJHH           
COBOL1231231231231231341234334

FILE2 :

123456789012345678901234567890           
COBOL1231231231231231341234334          
GYKCHYYIIHHFTIUIHGJUGTUHGFUYHG

Can any one explain me how to do it using SORT or JOINKEYS and also by using COBOL program. I need two output files.

Output FILE-OP1 : (which contain all the records having COBOL key-word from file1 & file2) NVWGNLVKAKOIVCOBOLLKASVOIWSNVS
SOSIVSNAVIS7780HLSVHSKSCOBOL56
1234567890COBOL1234556FCVHJJHH
COBOL1231231231231231341234334
COBOL1231231231231231341234334

Output File-OP2 (which contain only matching records with COBOL key-word from file1 & file2) COBOL1231231231231231341234334

Without knowing what output you want, it is not possible. JOINKEYS is part of SORT, so there is no "or". Unless you have to match the files in some way (which you don't mention) JOINKEYS is not for the task. The logic is the same, COBOL or SORT. Process files, produce output required. Implementation will be different, but we need what is expected for the output and where you are stuck in achieving that. — Bill Woodger, Feb 27 '15 at 09:03
Thanks for the response bill. This question is asked in one of my interview. Also, I don't have much exposure on using JOINKEYS, hence I'm very much interested to know the answer and expand my knowledge. I'm very much thankful if you can share any material related to JOINKEYS. — ASHOK POLURU, Feb 28 '15 at 00:53

score 0 · Accepted Answer · answered Feb 27 '15 at 14:17

0

An example, pseudo-codeish, Cobol:

Open File1
Read File1 into The-Record
Perform until End-Of-File
  Perform varying II from 1 by 1
    until II > length of The-Record
    If The-Record (II:5) = 'COBOL'
      Display "Found COBOL at position " II
    End-If
  End-Perform
  Read File1 into The-Record
End-perform

Repeat for file2 with the same program pointed at your other file.

As this sounds homework-y, I've left several little quirks that you will need to fix in that code, but you should see where it blows up or fails and be able to resolve those reasonably easily.

If you need to do some sort of matching and dropping between the two files, that is a different animal and you need to get your rules for it. Are you trying to match the files that have "COBOL" located in the same position or something? What behavior do you expect?

answered Feb 27 '15 at 14:17

Joe Zitzelberger

4,238
2
28
42

Thanks Joe. I have faced this question in one of my interview. I haven't used JOINKEYS so far and hence I wish to expand my knowledge by knowing the answer and concept of JOINKEYS. if you can share any material regarding this, it will be more helpful to me. – ASHOK POLURU Feb 28 '15 at 00:59
Probably the easiest way to do the "JOINKEYS" approach in a Cobol program would be to: 1) Sort both files into temp files, scanning for the work and appending it to a fixed position in the record, along with the original offset. 2) Do the traditional parallel reads on the two files to drop unmatched records and join the matched records in a second pass. It is inefficient, but would be what SORT/JOINKEYS would do anyway. – Joe Zitzelberger Feb 28 '15 at 02:34
Why do you feel it inefficient? What would be a more efficient way to do it? – Bill Woodger Mar 02 '15 at 09:21
There is no nice, efficient, easy way to join on a variable positioned key -- it pretty much does require a two pass process or make each reacord into a file-spaced scan. – Joe Zitzelberger Mar 02 '15 at 16:36
@JoeZitzelberger The records are to match in their entirety. It is that only those containing COBOL that should be matched. So simple, really. Order for output adds processing, but I doubt OP is really aware of the order needed. – Bill Woodger Mar 07 '15 at 07:46

score 0 · Answer 2 · edited May 23 '17 at 11:56

For your FILE1, SORT it on the entire input data, only including records which contain COBOL and appending a sequence number (you show your output in the original sequence). If there can be duplicate records, SORT on the sequence-number you attach as well.

Similar for FILE2.

The SORT for each program can be stand-alone (DFSORT or SyncSORT) or within a COBOL program.

You then "match" the files, here's a useful bit of pseudo-code from Bruce Martin: https://stackoverflow.com/a/22950005/1927206

Logically after the match, you then need to SORT both outputs on the sequence-number alone, and after that remove the sequence-numbers.

Remembering that you only need to know if COBOL is present in the data, if using COBOL for the first two SORTs you have a variety of ways to locate the word COBOL (and remembering you only need to know if it is there, not where it is or how many times it may be there): as Joe Zitzelberger showed, you can use a one-byte reference-modification, but be careful not to go beyond the data with your PERFORM VARYING (use compiler option SSRANGE if you are unclear what I mean); you can use INSPECT; UNSTRING; STRING; define you data with an OCCURS, for a length of five and use an index for a one-byte table; use OCCURS DEPENDING ON; do it "byte at a time"; etc.

score 0 · Answer 3 · answered Mar 07 '15 at 03:29

0

This is a little bit like free-format number handling.

You can use "SS" in DFSORT to find records containing cobol.

Step 1. read both infiles, produce one outfile OP-1

INCLUDE COND=(1,30,SS,EQ,C'COBOL')

Step2. produce a work file in the same way as step 1. using only File 1. Step3. produce a work file in the same way as step 1. using only File 2.

Run joinkeys on these two to find matches. ==> outfile OP-2

Essentially this strategy serves to eliminate non qualifying rows from the join.

answered Mar 07 '15 at 03:29

mckenzm

1,545
1
12
19

What's the point of Step 1? Why do you call records "rows", that's going to confuse a learner. It's a simple JOINKEYS with the INCLUDE COND= that you've showed in the JNFnCNTL datasets, and a SORT in the main task. OPs not interested, it would seem, anyway. – Bill Woodger Mar 07 '15 at 07:48
Step 1 produces only the first file required from a concatenation of the given files. It is never a good idea to take it upon yourself to combine separate requirements. One of them may later change and it makes traceability harder. I will pay "records" over "rows" though. -1 on me for failing to include the mandatory COBOL code content, but SORT was optional in this case. (omit SORTED). – mckenzm Mar 08 '15 at 19:53
But you can get both outputs from the JOINKEYS. Why would you need a separate step? – Bill Woodger Mar 08 '15 at 20:03
Agreed- we don't. What I am driving at is that the 2 files are separate, and whilst they can be produced in the same operation, and a competent resource would be able to maintain that code, it is better to assume that there are independent business reasons driving the requirement for each file. In the extreme this may give one BA or stakeholder the power of veto over a change to a single job step requested by another. In an academic setting, you may not get full marks, but you might get some - the intermediate files are the "show your working". – mckenzm Mar 08 '15 at 20:25
Nah. They are part of the same requirement. Even if a disjoinder arises later, simply copy and change. You don't want the client paying day-in-day-out for extra processing for some "it may happen in the future, or some possible future", do you? – Bill Woodger Mar 08 '15 at 20:33
Fair enough, maybe it is just the way I would suggest it to the OP then. Refinement is part and parcel of development. A small number of differing output files is OK. For larger numbers I like them to be similar. Makes a big difference to the cost of regression testing if the impact is smaller. I also like to consider the restart costs ;) . But all in the one step is fine then. – mckenzm Mar 09 '15 at 00:24
Fair enough, maybe it is just the way I would suggest it to the OP then. Refinement is part and parcel of development. A small number of differing output files is OK. For larger numbers I like them to be similar. Makes a big difference to the cost of regression testing if the impact is smaller. I also like to consider the restart costs ;) . – mckenzm Mar 09 '15 at 00:24
If you have lots of unconnected ICETOOL operators in a step, just to get them in one step, consider restart implications. If you do it in four steps when it should logically be on (plus presumably you have a fifth step to get the data into the required order) then the client overpays and you're making regression testing *more* complex. Anyway, not much point in continuing here. Can you update your answer with your points so that others do not get confused, and then we can tidy away (delete) all these comments. Thanks. – Bill Woodger Mar 09 '15 at 07:37

Find matched and unmatched records and position of key-word is unknown

3 Answers3