0

My office uses an in house tool that can generate a COBOL formatting program, COBOL File to DB2 load program and other associated objects all from a DDL. Usually, the files we load are delimited with a '|' but a new data source only sends files delimited with a comma.

The problem I am having is that some of the text fields have commas within them and the first thing the formatting program does after reading a record in the input data is run an inspect/tally to check the correct amount of delimiters are in the record. If too many delimiters are counted the record is discarded. When a comma is in the text field, it fails a record because it can't tell the difference between a comma in a field and a delimiter.

Fortunately all text fields are surrounded with quotes " so I was going to write some code which would check each character of a record one at a time keeping a tally of quotes and if it comes across a comma and the quote count is an odd number it will ignore the comma and not count it in the tally.

Any suggestions on how to do this?

Bill Woodger
  • 12,968
  • 4
  • 38
  • 47
mmackenzie93
  • 53
  • 1
  • 9
  • Can we assume z/OS, or is it LUW DB2? – Bill Woodger Apr 22 '15 at 12:14
  • Oh yes sorry i forgot about that it is z/OS – mmackenzie93 Apr 22 '15 at 12:22
  • You *can* do it in COBOL, in about the way you outlined, but you may want to consider using SORT to do the accept/reject of the file. You could count the commas, count the quotes, and commas minus (quotes divided by two) is the number of fields. If you want to consider that, you can ask another question with either the DFSORT or SyncSORT tag (depending on which you have) and also include reference to the version of the product you have. – Bill Woodger Apr 22 '15 at 12:49
  • Don't forget to consider embedded " in a text field. – Charles Apr 22 '15 at 12:52
  • thanks for the suggestion but only the character fields have quotes around them, there are integer and data fields which have no quotes, just a comma either side of them to delimit them. – mmackenzie93 Apr 22 '15 at 12:53
  • What @Charles means is that, in theory, you may get `..,"some ""text"" here",...`. You should arrange that test-data is supplied to you from the source system with some fields containing embedded quotes (you sometimes get house-names bounded by quotes, for instance) to ensure you know what it looks like if/when it occurs. – Bill Woodger Apr 22 '15 at 13:16
  • ah sorry that was a reply to you i hadn't seen the message from Charles, I see what you mean now though and will definitely look out for it/include it in my tests – mmackenzie93 Apr 22 '15 at 14:15
  • @BillWoodger had one example...another possibility is text with the " as a symbol for inches.. "12" Ruler" – Charles Apr 22 '15 at 16:28

2 Answers2

2

This assumes that you know how much data there is on the line (maximum if variable-length) for length-of-data-on-the-line, and that you will replace OCCURS 1000 by the maximum length of the line.

The idea is to use an on-off switch. First thing in the EVALUATE is to check for the quote. If found, flip the switch. Next is to say if the switch is on, ignore this byte. Next is, if it is a comma, count it.

Once the PERFORM is finished, the-count will contain the total number of non-quote-bounded commas.

The data-names I've chosen are for illustration of the technique. You change those to be relevant to your task.

01  length-of-data-on-the-line 
                              COMP PIC 9(4).

01  the-line.
    05  FILLER OCCURS 1000 TIMES.
        10  character-on-the-line  PIC X.
            88  cotl-is-comma      VALUE COMMA.
            88  cotl-is-quote      VALUE QUOTE.

01  FILLER.
    05  FILLER                     PIC X.
        88  on-off-switch-on       VALUE "1".
        88  on-off-switch-off      VALUE "7".

01  the-count                 COMP PIC 9(4).
01  data-on-line-sub          COMP PIC 9(4).


MOVE ZERO                   TO the-count
                               data-on-line-sub
SET on-off-switch-off       TO TRUE
PERFORM 
  length-of-data-on-the-line TIMES
    ADD 1                   TO data-on-line-sub
    EVALUATE TRUE
      WHEN cotl-is-quote ( data-on-line-sub )
        IF on-off-switch-off
            SET on-off-switch-on
                             TO TRUE
        ELSE
            SET on-off-switch-off
                             TO TRUE
        END-IF
      WHEN on-off-switch-on
        CONTINUE
      WHEN cotl-is-comma ( data-on-line-sub )
        ADD 1                TO the-count
    END-EVALUATE
END-PEFORM
Bill Woodger
  • 12,968
  • 4
  • 38
  • 47
  • Ok cheers for this, makes sense and looks like it will work. I got given a piece of selcopy which is doing the job. Once it's implemented and running I'll have a go at adding this to the code to remove the extra step of running selcopy. Better to keep it all in one program I think. – mmackenzie93 Apr 22 '15 at 14:13
  • @mmackenzie93 be sure to compile with the QUOTE option and not the APOST option. – cschneid Apr 22 '15 at 22:22
  • @mmackenzie93 cschneid's comment is down to my use of the figurative constant QUOTE, which with the compiler option QUOTE will use `"` and with APOST will use `'`. If people are prone to mess about with that compiler option, you have two other ways to do it. `VALUE '"'`, or by using a hexadecimal VALUE. I'll probably update my answer with this tomorrow. – Bill Woodger Apr 22 '15 at 22:46
0

Though more than requested, this program should handle most CSV records. It was not tested with tab-delimiters. The program converts the CSV text (with removal of added quotes) from delimited by a selected delimiter to delimited by LOW-VALUES. Making it easier to separate the fields with UNSTRING ... DELIMITED LOW-VALUES INTO ....

   IDENTIFICATION DIVISION.
   PROGRAM-ID. CSV2STR.
   DATA DIVISION.
   WORKING-STORAGE SECTION.
   01  I COMP PIC 9(4).
   01  J COMP PIC 9(4).
   01  FLD-START COMP PIC 9(4).
   01  STATE COMP PIC 9(4).
   01  FLD-SEP PIC X VALUE LOW-VALUES.
   01  QUOT PIC X VALUE """".
   01  APOS PIC X VALUE "'".
   01  COMM PIC X VALUE ",".
   LINKAGE SECTION.
   01  INPUT-REC PIC X(2000).
   01  INPUT-LENGTH COMP PIC 9(4).
   01  OUTPUT-REC PIC X(2000).
   01  OUTPUT-LENGTH COMP PIC 9(4).
   01  DELIM PIC X.
   PROCEDURE DIVISION USING INPUT-REC INPUT-LENGTH
       OUTPUT-REC OUTPUT-LENGTH DELIM.
   BEGIN.
       IF INPUT-LENGTH = 0 OR > 2000
           MOVE 0 TO OUTPUT-LENGTH
           EXIT PROGRAM
       END-IF
       IF DELIM NOT = SPACE
           MOVE DELIM TO COMM
       ELSE
           MOVE "," TO COMM
       END-IF
       PERFORM CONVERT-RECORD
       SUBTRACT 1 FROM J GIVING OUTPUT-LENGTH
       EXIT PROGRAM
       .

   CONVERT-RECORD.
       MOVE 1 TO STATE I J FLD-START
       PERFORM CONVERT-RECORD-PROC
           UNTIL I > INPUT-LENGTH
       MOVE FLD-SEP TO OUTPUT-REC (J:1)
  *> FOR NO FIELD AFTER THE LAST DELIMITER
       IF INPUT-REC (I - 1:1) = COMM
           ADD 1 TO J
           MOVE FLD-SEP TO OUTPUT-REC (J:1)
       END-IF
       .

   CONVERT-RECORD-PROC.
  *> CSV-DT
       EVALUATE STATE
           ALSO I = FLD-START
           ALSO INPUT-REC (I:1)
           ALSO INPUT-REC (I + 1:1)
  *> RULE   1 DETERMINES IF FIELD BEGINS WITH QUOTE
       WHEN 1 ALSO TRUE ALSO QUOT ALSO ANY
           MOVE 2 TO STATE
           ADD 1 TO I
  *> RULE   2 SPECIAL CASE OF SPACE + APOSTROPHE AT FIELD START
       WHEN 1 ALSO TRUE ALSO SPACE ALSO APOS
           ADD 1 TO I
  *> RULE   3 COPIES ONE CHARACTER
       WHEN 1 ALSO ANY ALSO NOT COMM ALSO ANY
           MOVE INPUT-REC (I:1) TO OUTPUT-REC (J:1)
           ADD 1 TO I
           ADD 1 TO J
  *> RULE   4 ENDS A FIELD
       WHEN 1 ALSO ANY ALSO COMM ALSO ANY
           MOVE FLD-SEP TO OUTPUT-REC (J:1)
           ADD 1 TO I
           ADD 1 TO J
           MOVE I TO FLD-START
  *> RULE   5 FOR QUOTED FIELD DROPS INITAL QUOTE
       WHEN 2 ALSO ANY ALSO NOT QUOT ALSO ANY
           MOVE INPUT-REC (I:1) TO OUTPUT-REC (J:1)
           ADD 1 TO I
           ADD 1 TO J
  *> RULE   6 FOR QUOTED FIELD CONVERTS TWO QUOTED TO ONE
       WHEN 2 ALSO ANY ALSO QUOT ALSO QUOT
           MOVE QUOTE TO OUTPUT-REC (J:1)
           ADD 2 TO I
           ADD 1 TO J
  *> RULE   7 FOR QUOTED FIELD DROPS QUOTE BEFORE DELIMITER
       WHEN 2 ALSO ANY ALSO QUOT ALSO COMM
           MOVE FLD-SEP TO OUTPUT-REC (J:1)
           ADD 2 TO I
           ADD 1 TO J
           MOVE I TO FLD-START
           MOVE 1 TO STATE
  *> RULE   8 FOR QUOTED FIELD DROPS FINAL QUOTE OF LAST FIELD
       WHEN 2 ALSO ANY ALSO QUOT ALSO SPACE
           ADD 2 TO I
           ADD 1 TO J
           MOVE I TO FLD-START
           MOVE 1 TO STATE
       END-EVALUATE
       .

The following CSV file was tested:

120,ABC,123,"12"" RULER","""ABC"", ""DEF"", ""GHI""", 'ABC',"123,456"
"""789""",,,,"""mno""",,

Test program output:

1: 120
2: ABC
3: 123
4: 12" RULER
5: "ABC", "DEF", "GHI"
6: 'ABC'
7: 123,456
1: "789"
2:
3:
4:
5: "mno"
6:
7:
Rick Smith
  • 3,962
  • 6
  • 13
  • 24