0

I have a problem with Java because I have a file with ASCII encoding and when I pass that value to the output file it changes special characters that I need to keep:

Original file: enter image description here

Output file: enter image description here

The code I use to read an ASCII file and pass it to a string that has a length of 7000 and the problem with that file where it reaches the special characters that within the frame or string that is the position 486 to 498 the FileRender does not bring the special characters correctly changes them for others and does not keep them (as I understand it is a binary):

            fr = new FileReader(sourceFile);
            //BufferedReader br = new BufferedReader(fr);
            BufferedReader br = new BufferedReader(
                    new InputStreamReader(new FileInputStream(sourceFile), "UTF-8"));

            String asciiString;
            asciiString = br.readLine();

Edit:

I am doing a conversion from ASCII to EBCDIC. I am using CharFormatConverter.java

I really don't understand why the special characters are lost and not maintained. I found the UTF-8 code in another forum, but characters are still lost. Read file utf-8

Edit:

I was thinking about using FileReader for the ASCII data and FileInputStream to get the binary (but I can't figure out how to get it out with respect to the positions) that is in the ASCII file and thus have the two formats separated and then merge them after the conversion.

Regards.

  • 5
    The characters in your 'original file' aren't ascii, so when you say "_I have a problem with JAVA because I have a file with ASCII", no you don't. That makes the rest of the question confusing. Perhaps read up on what ASCII is, and then re-consider what your input file actually contains (because it's not ASCII). Perhaps ISO-8859-1, or CP-1252, or CP852...? – rzwitserloot Sep 28 '22 at 04:27
  • 2
    If it's ASCII, why would you read it as UTF-8? – shmosel Sep 28 '22 at 04:28
  • Hi @rzwitserloot, I'm making an ASCII to EBCDIC converter, based on CharFormatConverter [CharFormatConverter](https://gist.github.com/joseporiol/8541410) from github, but the problem I'm having is that just bringing the ASCII value from the file changes characters that should be kept – Edisson Gabriel López Sep 28 '22 at 04:41
  • Hi, @shmosel. I wanted to see if this way I could avoid that the characters change because as I understand what is in the file that part is binary and I read about using UFT 8 or UTF 16, but it changes several characters if I leave it and it works a little better if I leave it like this BufferedReader br = new BufferedReader(fr); but it keeps changing characters. – Edisson Gabriel López Sep 28 '22 at 04:44
  • @rzwitserloot Another thing I do after converting to EBCDIC is to use CP1047 to get the output file, but the problem is FileRender(ASCII file) because it changes the characters and that is the only problem I have with the conversion and when testing it rejects the file.. The valid way is to use a program called HxD – Edisson Gabriel López Sep 28 '22 at 04:48
  • 4
    Those characters are not part of the ASCII character set. If you want to treat the data as binary rather than as encoded characters, use the `FileInputStream` without wrapping it in a `Reader` and read the raw bytes. – Tim Moore Sep 28 '22 at 04:55
  • Hi @Tim Moore, Let's say I was checking FileInputStream and I'm a little confused with the length of the frame, does the length hold? It's just that binary starts very from 486 to 498 which is the binary I'm having trouble with. – Edisson Gabriel López Sep 28 '22 at 05:06
  • 1
    What is “length of the frame” supposed to mean? The position of the binary doesn’t matter. Either, your file is a text file or it is not. You can’t do a text conversion for something that isn’t text. What should “non text data encoded as EBCDIC” be? – Holger Sep 28 '22 at 06:37
  • @Holger in the ASCII file a transaction represents a length of 7000 in ASCII and the part where the binary of length 7000 is, I will always find it in position 486 and only there I have the problem for that binary – Edisson Gabriel López Sep 28 '22 at 12:40
  • 1
    Can I recommend https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ – Brian Agnew Sep 28 '22 at 13:45
  • What encoding / codepage is your input file acutally?? You have to specify the correct Charset for your input file to enable Java to read the file correctly. – cyberbrain Sep 28 '22 at 17:13

2 Answers2

0

Since your base code (CharFormatConverter) is byte-oriented, and it looks like your input files are binary, you should replace Readers by InputStreams, which produce bytes (not characters).

This is the ordinary way to read and process an InputStream:

private void convertFileToEbcdic(File sourceFile)
throws IOException
{    
    try (InputStream input=new FileInputStream(sourceFile))
    {
        byte[] buffer=new byte[4096];
        int len;
        do {
            len=input.read(buffer);
            if (len>0)
            {
                byte[] ebcdic=convertBufferFromAsciiToEbcdic(buffer, len);
                // Now ebcdic contains the buffer converted to EBCDIC. You may use it.
            }
        } while (len>=0);
    }
}

private byte[] convertBufferFromAsciiToEbcdic(byte[] ascii, int length)
{
    // Create an array of same input as received
    // and fill it with the input data converted to EBCDIC
}
Little Santi
  • 8,563
  • 2
  • 18
  • 46
  • 2
    How does that help if the `convertBufferFromAsciiToEbcdic` method still assumes the content to be ASCII? The source can only be either, ASCII or Binary. In the latter case, the entire task of doing a character conversion makes no sense. – Holger Sep 28 '22 at 06:32
  • 1
    Might be better to drop the term 'ascii' as it is only useful if it happens to be US-ASCII, which is an actual encoding known to Java. @Holger is correct, in that there's no getting around the fact that, even if raw bytes are read, if the original encoding is not known precisely, nothing correct can be done as a conversion. 'ascii' encodings differ in their upper regions. In the unlikely event it *is* US-ASCII then you *can* convert OK – g00se Sep 28 '22 at 08:19
  • It didn't work for me, information is lost and characters change. I don't know if I'm missing something. – Edisson Gabriel López Sep 28 '22 at 13:21
  • 1
    @EdissonGabrielLópez what you’re missing, is what has been told to you several times. Your file is not ASCII. The characters `¿¢þù€` do not exist in ASCII. When you see them in a file viewer, the file viewer obviously is already presuming an encoding other than ASCII. To read a file under the same assumption, you would have to use exactly the same encoding as the file viewer. However, you still can’t convert these characters to EBCDIC, as most of them do not exist in EBCDIC either. Not to speak of binary data in general. I suppose, the dots stand for non-character byte values rather than `'.'` – Holger Sep 29 '22 at 06:50
  • @Holger It helps because this *is* the proper way to read bytes, and so it will not introduce additional unintentional sw errors when reading the input data (like in the OP original program, which used Readers with some suspicious character conversions). Upon this program, it will be easier to prepare a Unit Test which shows what will be the expected result from a "model" input file. Making sure the input file is OK is the OP's duty. – Little Santi Sep 29 '22 at 12:52
0

If your info in the file is a binary info and not textual you can not read it as a String and no charset will help you. As charset is a schema that tells you how to interpret particular character into numeric code and vise-versa. If your info is not textual charset won't help you. You will need to read your info as binary - a sequence of bytes - and write them the same way. you will need to use InputStream implementation that reads info as binary. In your case a good candidate might be FileInputStream. But some other options may be used

Michael Gantman
  • 7,315
  • 2
  • 19
  • 36