1

My java program is trying to read a text file (Mainframe VSAM file converted to flat file). I believe this means, the file is encoded in EBCDIC format.

I am using com.ibm.jzos.FileFactory.newBufferedReader(fullyQualifiedFileName, ZFile.DEFAULT_EBCDIC_CODE_PAGE); to open the file.

and use String inputLine = inputFileReader.readLine() to read a line and store it in a java string variable for processing. I read that text when stored in String variable becomes unicode.

How can I ensure that the content is not corrupted when storing in the java string variable?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
yathirigan
  • 5,619
  • 22
  • 66
  • 104
  • 4
    If you choose the correct encoding on that BufferedReader, nothing will be corrupted. The conversion to Unicode (which has to happen for Java Strings) is loss-less. – Thilo Aug 25 '17 at 12:28

2 Answers2

3

The Charset Decoder will map the bytes to their correct Unicode for String. And vice versa.

The only problem is that the BufferedReader.readLine will drop the line endings (also the EBCDIC end-of-line NEL char, \u0085 - also a recognized Unicode newline). So on writing write the NEL yourself, or set the System line separator property.

Nothing easier than to write a unit test with 256 EBCDIC characters and convert them back and forth.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
2

If you have read the file with the correct character set (which is the biggest assumption here), then it doesn't matter that Java itself uses Unicode internally, Unicode contains all characters of EBCDIC.

A character set specifies the mapping between a character (codepoint) and one or more bytes. A file is nothing more than a stream of bytes, if you apply the right character set, then the right characters are mapped in memory.

Say the byte 1 maps to 'A' in character set X and bytes 0 and 65 in UTF-16, then reading a file which contains byte 1 using character set X will make the system read character 'A', even if that system in memory uses bytes 0 and 65 to store that character.

However there is no way to know if you used the right character set, unless you specifically now what the actual result should be.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197