handling comp3 and ebcidic conversion in java to ASCII for large files

Question

I am trying to convert comp3 and EBCIDIC characters in my java code but im running into out of memory exception as the amount of data handled is huge about 5 gb. my code is currently as follows:

byte[] data = Files.readAllBytes(path);

this is resulting in an out of memory exception which i can understand, but i cant use a file scanner as well since the data in the file wont be split into lines.

Can anyone point me in the correct direction on how to handle this

Note: the file may contain records of different length hence splitting it based on record length seams not possible.

You handle it one record at a time. There is never any need to load an entire file into memory. Compilers don't do it: why should you? — user207421, Oct 15 '15 at 08:10
I agree with you, i dont want to load the entire file once, but the record length are varying say, 1st 10 line 140 chars, 20-30 40 chars 40-45 140 chars. these records are identified by record_id present in the record. im scepticle at fetching based on a chunk size — BaN3, Oct 15 '15 at 08:23
You don't. You fetch it based on a record size. Somehow the original programs that read this file read it one record at a time. You can too. There is either a length word or a fixed delimiter in these records. Tell us. — user207421, Oct 15 '15 at 08:58
there is no delimiter its just continuous data, ill have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length. — BaN3, Oct 15 '15 at 09:03
Somehow these records must be self-defiining. If you can separate them by reading the entire file into memory, which is what you're essentially claiming here, you can separate them one at a time. Or else the task is impossible either way. You need to provide some information about the format. If the record starts with a length word, there's nothing stopping you from reading the length word and then the rest of the record. — user207421, Oct 15 '15 at 09:05
Presumably the data is coming from a Mainframe, as that covers most of the EBCDIC world. Firstly, they should never have given your non-character data (packed-decimal or binary fields). Perhaps they are arguing against that because of the size of the data. If all the fields were character, they could use the Mainframe SORT product to convert the data to (your brand of) ASCII, and then transfer the file to you as binary, allowing you to access the first two bytes of the record, which probably contain the record-length. Show, in hex, a sample of your data. — Bill Woodger, Oct 15 '15 at 10:11
It would be possible for you to do this, once you know the length of the record, but it is a silly, error-prone and suspect way to do it. It is going to take you a lot longer to convert than it would take SORT. However, there may be a charging issue there. Have a look at other questions tagged comp-3 and ebcdic. — Bill Woodger, Oct 15 '15 at 10:13

score 0 · Answer 1 · answered Oct 15 '15 at 23:13

0

As Bill said you could (should) ask for the data to be converted to display characters on the mainframe and if English speaking you can do a ascii transfer.

Also how are you deciding where comp-3 fields start ???

You do not have to read the whole file into memory, you can still read the file in blocks, This method will fill an array of bytes:

protected final int readBuffer(InputStream in, final byte[] buf)
throws IOException {

    int total = 0;
    int num = in.read(buf, total, buf.length);

    while (num >= 0 && total + num < buf.length) {
        total += num;
        num = in.read(buf, total, buf.length - total);
    }
    return num;
}

if all the records are the same length, create an array of the record length and the above method will read one record at a time.

Finally the JRecord project has classes to read fixed length files etc. It can do comp-3 conversion. Note: I am the author of JRecord.

answered Oct 15 '15 at 23:13

Bruce Martin

10,358
1
27
38

This method: `DataInputStream.readFully()` will also fill an array of bytes, and it's somewhat better tested. – user207421 Oct 15 '15 at 23:23
yes this approach will work fine but i sometimes have records of varying length as well. I would like to rad file in considerable chunks process it based on a record identifier. If there is some residual data left in the chunk as records are of variable length i would like to reset to previous offset and continue processing again in new chunk. – BaN3 Oct 16 '15 at 09:15
By the sounds of it you have Mainframe-VB files without RDW (record descriptor word, actually it just the record length). Dropping the RDW tends to be the default when sending VB files from the mainframe to other platforms. There is normally an option to retain the RDW when sending a VB file to the PC. I think it is safer to retain the RDW. JRecord has routines for reading Mainframe-VB files (either as byte array or its own Line class) – Bruce Martin Oct 16 '15 at 11:12

user207421 · Answer 2 · 2015-10-15T23:43:30.070

I'm running into out of memory exception as the amount of data handled is huge about 5 gb.

You only need to read one record at a time.

My code is currently as follows:

byte[] data = Files.readAllBytes(path);

This is resulting in an out of memory exception which i can understand

Me too.

but i cant use a file scanner as well since the data in the file wont be split into lines.

You mean you can't use the Scanner class? That's not the only way to read a record at a time.

In any case not all files have record delimiters. Some have fixed-length records, some have length words at the start of each record, and some have record type attributes at the start of each record, or in both cases at least in the fixed part of the record.

I'll have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length

So read that attribute, decode it if necessary, and read the rest of the record according to the record length you derive from the attribute. One at a time.

I direct your attention to the methods of DataInputStream, especially readFully(). You will also need a Java COMP-3 library. There are several available. Most of the rest can be done by built-in EBCDIC character set decoders.

I agree with you, currently im also havinga similar approach of reading in chunks. identifying the record based on record_id and processing it. using mappedBuffer for it and facing issue in offset as offset is int and the file size exceeds that.. I will face a similar issue trying to read from using DatainputStream as well I believe. mb = ch.map( FileChannel.MapMode.READ_ONLY,prevZ, bufLength*lineLength ); mb.get( data, prevZ, nGet ); prevZ is the last record position successfully read, soi can reset and fetch new chunk, but this sometimes exceeds int range and i end up with negetive values — BaN3, Oct 16 '15 at 09:09

handling comp3 and ebcidic conversion in java to ASCII for large files

2 Answers2