3

I try to copy files with some mandatory parameters : inputEncoding, outputEncoding and outputLineSeparator.

But when I run my following code, my file with the CRLF final char is not well copy, this last CRLF char disapear.

I think readLine return null after line 3 because line 4 is empty...


My goal is to find the copyfile function that could strictly copy this two following files.
Is there a way to copy also this final empty line (the last newline char) ?


Thanks in advance for any help.


Input Files
File testInEndNL.txt(explicit char)

A<CRLF>
B<CRLF>
C<CRLF>

File testOutEndEOF.txt(explicit char)

A<CRLF>
B<CRLF>
C


Output Files
File testOutEndNL.txt(explicit char) KO for me

A<LF>
B<LF>
C

File testInEndEOF.txt(explicit char) OK for me

A<LF>
B<LF>
C


Code

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;


public class TestEncoding {

    public static void main(String[] args) {
        File src;
        File dst;
        Charset inputEncoding;
        Charset outputEncoding;
        String outputLineSeparator;

        inputEncoding = Charset.defaultCharset();
        outputEncoding = Charset.forName("UTF-16");
        outputLineSeparator = "\n";

        src = new File("C:\\Users\\Dam\\Desktop\\testFiles\\testInEndNL.txt");
        dst = new File("C:\\Users\\Dam\\Desktop\\testFiles\\testOutEndNL.txt");
        copyFile(src, dst, inputEncoding, outputEncoding, outputLineSeparator);

        src = new File("C:\\Users\\Dam\\Desktop\\testFiles\\testInEndEOF.txt");
        dst = new File("C:\\Users\\Dam\\Desktop\\testFiles\\testOutEndEOF.txt");
        copyFile(src, dst, inputEncoding, outputEncoding, outputLineSeparator);

    }

    private static void copyFile(File src, File dst, Charset inputEncoding, Charset outputEncoding, String outputLineSeparator) {
        try {
            String oldLineBreak = System.setProperty("line.separator", outputLineSeparator);
            BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(src), inputEncoding));
            BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dst), outputEncoding));
            String line = reader.readLine();
            if(line != null) writer.write(line);
            while ((line = reader.readLine()) != null) {
                writer.newLine();
                writer.write(line);
            }
            reader.close();
            writer.close();
            System.setProperty("line.separator", oldLineBreak);
        }
        catch(Exception e) {
            e.printStackTrace();
        }
    }
}
Damien
  • 31
  • 3

1 Answers1

1

The short answer is that you can't do it with .readLine(), because it strips off the end-of-line characters.

You will need to use .read() instead, which will read in character by character, and do your own EOL processing. This returns an int, but you can cast it to a char:

char c = (char)reader.read();

A CR/LF will come out as two separate characters, so you'll have to watch for that and process accordingly. If you know that your files will only have either CR/LF or just LF, then it's a little easier, because whenever you read a CR you know an LF is following straight behind.

The bytes you're reading won't vary. You'll be reading them as UTF-8 by default, which will always encode CR and LF as single byte characters. In int terms, they'll come out as 13 and 10 respectively.

chiastic-security
  • 20,430
  • 4
  • 39
  • 67
  • Oh, so bad... Thanks for the answer And if I do with read, how could I detect several end of line (LF, CRLF, CR) from different encoding ? read return an int, "CRLF" for example is 2 int (byte) long and bytes may differ with some strange encoding... – Damien Nov 06 '14 at 20:45
  • Oh ok, so we are sure that read will get only a byte (8 bit) right ?
    I'd like to do like readLine do, we have only LF, CRLF and CR ? The other problem is that inputEncoding may vary so encoding bytes is not always the same. Sory for brain toture :/
    – Damien Nov 06 '14 at 21:06
  • you'll actually read as [UTF-16](http://stackoverflow.com/a/14224007) but that doesn't change a thing – zapl Nov 06 '14 at 21:06
  • @zapl Yes, it's just for the example. Imagine random charset for input and output – Damien Nov 06 '14 at 21:08
  • @Damien It doesn't matter. Each `.read()` will give you a single character, no matter how it's encoded. And you don't care how most of them are encoded, because if they're not EOL characters, then you just copy them to the output stream. A CR or LF will always come out as a single byte, either 13 or 13. – chiastic-security Nov 06 '14 at 21:12
  • I have make a test : `for(Charset charset : Charset.availableCharsets().values()) System.out.println(charset.encode("\n").array()[0]);` I can see differents values... So there is no way to abstract it ? – Damien Nov 06 '14 at 21:20
  • @Damien Let me rephrase... whatever encoding you use, a `.read()` will give a single character. And in the encoding you'll get from a `.read()` in its default state, you'll get either a 13 or a 10 for a CR or LF. But if you're specifying a particular encoding, then yes, you might end up with things coming out differently, and you'll need to use the encoding that was used to create the file. – chiastic-security Nov 06 '14 at 21:25
  • @Damien Think about it like this. If the file contains hex chars `48 10`, then in UTF-8 that would be an `H` followed by a LF. But you could create your own encoding in which this would be a single character if you felt like it. The point is that whether an EOL is present in the file is dependent on the encoding, and if you read the file with the wrong encoding, you'll get wrong results. Not much you can do about that. – chiastic-security Nov 06 '14 at 21:27
  • Yes I understand. So if I'm waiting for read result : `13` or `10` or `13 10` I can replace all newline. Good thing ! But I can't change my output encoding in this case right ? – Damien Nov 06 '14 at 21:35
  • @Damien look at https://gist.github.com/anonymous/a212d37f7e1eeed61eb5 - the output encoding of `'\n'` is handled by the `Writer`. – zapl Nov 06 '14 at 21:42
  • @Damien You can change your output encoding, yes. Write a newline, and the output writer will do the right thing with it. – chiastic-security Nov 06 '14 at 21:43
  • @chiastic-security Nice I had try, it work fine for the end of line. But I'm not sure about encoding... if I add `BZażółć gęślą jaźń` in my input file, convert it to UTF8 wo BOM, change outputEncoding for UTF8, result file get : `Zażółć gęślÄ… jaźń` – Damien Nov 06 '14 at 22:23