0

I have to parse a text like this:

 000001        01 ROUTINE_NAME.                                                     
 000002       *                                                                 
 000003           05 ROUTINE_NAME-INPUT.                                            
 000004       *----------------------------------------------------------------*
 000005       *  FIELD DESCRIPTION                            OBBLIGATORIO     *
 000006       *----------------------------------------------------------------*
 000007              10 ROUTINE_NAME-FIELD_NAME                 PIC X(005).       

What is the best way to parse such things? Is there an existing library that does that?

Vitaly Olegovitch
  • 3,509
  • 6
  • 33
  • 49
  • 2
    Do as a human would, parse indentation of spaces till first non-space. Cobol is positional, and java string handling should suffice. Maybe regex Pattern/Matcher. Cobol is not that difficult. A language parsing library is probably overkill. – Joop Eggen Nov 11 '15 at 11:21
  • Parse it to accomplish what? If you want to correctly map these field declarations to correct offsets, lengths, and types, you have quite a job in front of you. – user207421 Nov 11 '15 at 11:40
  • 2
    Asking for a library is Off Topic. Asking how to do it is Too Broad. Has to be a specific programming problem. Yes it is hard, and EJP knows much more about it than @JoopEggen. COBOL is not positional. I'd guess the paste comes from the ISPF Editor? Meaning Mainframe? Meaning if you want to make your job easy, use the output listing from a compile. You get the compiler to do all the work, then you have a very simple (positional) interpretation of the output listing, the only remotely tricking thing being the number of OCCURS. – Bill Woodger Nov 11 '15 at 11:45
  • 1
    @Eggen: COBOL is a lot harder to parse than you think, if you really handle what is allowed in modern COBOL dialects for COBOL or MF: escapes, debug lines, expanding copylibs, fixed/varying line sizes. In the case of data declarations, the nesting levels act like indications of containers (except for several special values) but most parsers cannot pick up that nesting easily. Now you have to pick up the type and any possible multiline initializers. *If* you get past literal parsing, then you have to interpret the types to compute offsets. And we haven't discussed name lookup. – Ira Baxter Nov 11 '15 at 13:19
  • @Eggen: the standard parsing story applies here: if you want to do a bad job, you can use regex. If you want to do a good job, you need a strong parsing engine. – Ira Baxter Nov 11 '15 at 13:20
  • @Vitalij: You might reconsider submitting question to http://softwarerecs.stackexchange.com/questions/ask, which accepts such questions as useful. – Ira Baxter Nov 11 '15 at 13:22
  • @Baxter: do you have an example of a hard to parse copybook? – Dave Griffiths Oct 03 '16 at 13:43
  • @DaveGriffiths usually copybooks with nested data structures are harder to parse. – Vitaly Olegovitch Oct 04 '16 at 11:03

2 Answers2

3

There are a number of mapping tools that will convert a Cobol copybook into something more easily read in java, like xml. If you only need a single copybook, then by far, the easiest way to do this is by hand and using a bytearray.

If what you are trying to create is something that can take any copybook and allow you to read/write that structure in java, then something like IBMs DFDL or a similar tool is called for.

If you want to convert files described by that copybook, then an ETL tool like Syncsort or Datastage might be a good idea.

A recursive descent parser for the Cobol picture clause is pretty easy to write, but it might be overkill if you are only doing a single use thing.

Really, to give any kind of answer, more detail about what you are trying to accomplish is needed.

Joe Zitzelberger
  • 4,238
  • 2
  • 28
  • 42
  • A recursive decent parser for the PICTURE clause? Come off it. That's a scanning job. No recursion, and therefore no parsing, necessary. – user207421 Nov 11 '15 at 21:46
  • Seriously, recursive decent is the easiest style of parser to implement without other tools like lex/yacc or bison/flex. Totally easy to turn a grammar into code using a one for one translation. – Joe Zitzelberger Nov 12 '15 at 07:21
  • @Joe: I've been building parsers for 40 years. Recursive descent for simple stuff (add/multiply expressions) is OK. A good parser generator stomps these for productivity especially for complex languages (e.g., COBOL). – Ira Baxter Nov 12 '15 at 11:16
  • @Ira - I'm not shooting for performance, just ease of implementation. And I'd certainly not recommend it for the nightmare that is Cobol verbs, but the data description is pretty easy on that front. – Joe Zitzelberger Nov 12 '15 at 14:21
2

If you want to parse just a copybook, have a look at the java project cb2xml, it will parse cobol copybooks and calculate field position / length for fields. The package actually converts the Cobol copybook into Xml which can then be parsed in many languages.

if you use the cb2xml.jar and cb2xml_jaxb.jar in the cb2xml project, you can parse the Cobol copybook in java with:

        Copybook copybook = CobolParser.newParser()
                                .parseCobol(copybookName);

to print the contents in java :

        List<Item> items = copybook.getItem();
        for (Item item : items) {
            printItem("   ", item);
        }
    }

    public static void printItem(String indent, Item item) {
         System.out.println(indent + item.getLevel() + " " + item.getName() +"\t" + item.getPosition() 
                + "\t " + item.getStorageLength() + "\t" + item.getPicture());

        List<Item> items = item.getItem();           
        for (Item child : items) {
            printItem(indent + "   ", child);
        }
    }

If you use cb2xml to convert Cobol to Xml

000001    01 ROUTINE-NAME.                                              
000003       05 ROUTINE-NAME-INPUT.                                     
000007          10 ROUTINE-NAME-FIELD-NAME                 PIC X(005).  

gets converted to

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<copybook filename="xx.cbl">
    <item         display-length="5" level="01" name="ROUTINE-NAME"                                 position="1" storage-length="5">
        <item     display-length="5"   level="05" name="ROUTINE-NAME-INPUT"                         position="1" storage-length="5">
            <item display-length="5"     level="10" name="ROUTINE-NAME-FIELD-NAME" picture="X(005)" position="1" storage-length="5"/>
        </item>    
    </item>
</copybook>

disclosure: I was on of the contributers to cb2xml


There are other projects around (e.g. legstar) for parsing Cobol. Also Koopa Cobol Parser

Bruce Martin
  • 10,358
  • 1
  • 27
  • 38