3

For large strings (60MB or so long), FileWriter is appending extra nulls to the end of my files. For small strings this code works as expected.

For clarity, dat and filePath are Strings.

FileWriter fstream = new FileWriter( filePath );
fstream.write( dat );
fstream.close();

File f = new File( filePath );         
System.out.println("Data: " + dat.length() + ", File: " + f.length());

In short, under what circumstances, should the two printed values be different?

Here's my example output:

Data: 63833144, File: 63833728

I got 584 extra nulls at the end of file for some reason. I find it reasonable that the string might be over allocated, but these shouldn't print to file, right ? To make things worse, if I explicitly give it the length:

fstream.write(dat, 0, dat.length());

The behavior is the same. Coincidentally, if I say (dat.length() - 584), it does what I want, but only in this specific case.

Any ideas?

JDK version: 1.7.0_02

Edited: Add file types for variables (both Strings)

Chad Mourning
  • 608
  • 2
  • 12
  • 25
  • 2
    `dat` is a `String`, right? Does your `String` contain any special characters? You know you are comparing the length of a `String` in characters with file length in bytes? Not necessarily the same. – Tomasz Nurkiewicz Jan 11 '13 at 21:37
  • @TomaszNurkiewicz I know of no Unicode encoding that would ever append 584 extraneous nulls at the end of an encoded string. – millimoose Jan 11 '13 at 21:42
  • Is `dat` a `char[]` array? – Alexander Pogrebnyak Jan 11 '13 at 21:48
  • dat and filePath are both Strings – Chad Mourning Jan 11 '13 at 22:46
  • @TomaszNurkiewicz dat is a string, and should be the contents of an HTML file in this case. You are correct, dat.length() does not match dat.getBytes().length(). In fact, it matches the filesize value, so are you saying that FileWriter.write() should not be expected to output the contents of the String, but rather it's internal representation? I tried to convert it to the charArray and had the same issue. What is the right way to accomplish this? – Chad Mourning Jan 11 '13 at 22:55
  • @ChadMourning **1)** check the String itself, whether it contains trailing `\0`s, and how many of them. **2)** check the code *reading* the HTML file into a `String`. It's very possible that's where the root cause lies, it's not unlikely you're getting some buffer sizes confused. E.g. you're reading into a `char` buffer array whose length is the filesize in `byte`s (this is a terrible way to read an entire file in a couple of ways). – millimoose Jan 11 '13 at 23:22
  • In that scenario, ignoring the encoding would cause the behaviour you're seeing. That is, because of (byte sequences that the `Reader`'s encoding considers to be) multibyte characters being in the file, you allocate a `char` array that's too big. That means you never write to the end of it, and because a char array is initialised with `\0`s, the string you build from it will have trailing nulls. (That said, this is just me guessing a scenario that could lead in your observed problem.) – millimoose Jan 11 '13 at 23:28
  • @millimoose I suppose my question is then, what is the right way to write just the content of the String as opposed to the entire allocated part? – Chad Mourning Jan 12 '13 at 00:29
  • @ChadMourning It's the wrong question, because if I'm right your problem is the reading not the writing - you should fix that instead of devising a workaround that comes into play much later. Your question should be "here's the code reading the HTML, why does it give me a bunch of trailing nulls and what to do about it?" (Assuming said trailing nulls aren't in the HTML file, and that my guess is correct. Have you checked the file reading code then?) – millimoose Jan 12 '13 at 00:53
  • It appears that what is going on, is that the String contains some 2 byte characters (mostly degree symbols), exactly 584 of them, so when they are getting written back out, for whatever reason, they are getting written back out as 1 byte, and the excess room is being padded with nulls. I suppose this is an encoding issue then? Still trying to figure out what I have to do to fix that. – Chad Mourning Jan 12 '13 at 01:38

4 Answers4

2

What is "dat"? If "dat" is a StringBuffer, you need to be careful. If the length of the StringBuffer is greater than its contents, then nulls will be appended to the end. You might try to use dat.toString(). The null characters will be trimmed in the conversion, I believe.

mightyrick
  • 910
  • 4
  • 6
  • 1
    Do you have any references (like a bug report) for this behavior? Because `StringBuffer` (or `StringBuilder`) should behave like any other `CharSequence` in this case. – parsifal Jan 11 '13 at 22:20
  • Check the setLength() method of the Javadoc. It does offer some indication of how StringBuffer behaves behind the scenes regarding length and nulls. http://docs.oracle.com/javase/6/docs/api/java/lang/StringBuffer.html#setLength(int) – mightyrick Jan 11 '13 at 22:24
  • It is a String, but I did notice that .getBytes().length of that string was 584 longer than .length(), and therefore matched the values I outputted above. I don't know what would be occurring exactly 584 times in 60MB files though. – Chad Mourning Jan 11 '13 at 22:48
  • It literally could be "null" characters. char( 0 ) Null characters and terminators are completely valid within the context of a string or file. In C, null terminators are frequently used. I'm not sure how the strings are being read in from the files, but I think trimming and/or detecting null (and ignoring them) is probably important here. – mightyrick Jan 11 '13 at 22:52
  • I suppose my question is, then, what is the right way to output just the string content of the String? – Chad Mourning Jan 11 '13 at 22:57
  • If I had to remove nulls from a string, then I would do something like this: myString = myString.replace('\0', ''); Then, you can print it out without fear of null characters being contained within it. – mightyrick Jan 11 '13 at 23:02
  • @RickGrashel The question is, why would the length of the `StringBuffer` ever be explicitly set to anything? I've honestly never used that method once. (Do note that the length is different than the capacity. – millimoose Jan 11 '13 at 23:31
  • StringBuffer specifically has a default backing buffer size. But OP clarified that he isn't using a StringBuffer, he's using a String. So if null characters are getting into his String, then I really think that his file is padded with nulls at the end (which is actually very common for some file types). – mightyrick Jan 12 '13 at 03:37
1

I suggest that you never use FileWriter, because it is using default encoding on your platform to convert String to byte stream.

Instead you should do this:

Writer writer =
  new OutputStreamWriter( 
    new FileOutputStream( fileName ),
    // Always specify encoding compatible with your string
    "UTF-8"
  );

try
{
  writer.write( dat );
  writer.flush( );
}
finally
{
  writer.close( );
}

Also, the String length and resulting byte stream length don't have to match. They will match only for ASCII text string.

Alexander Pogrebnyak
  • 44,836
  • 10
  • 105
  • 121
  • Good general advice on the encoding, but it doesn't explain the behaviour the OP is seeing. – millimoose Jan 11 '13 at 21:43
  • @millimoose. It's hard to go on the very limited amount of information that OP has provided. Is `dat` a `String` or `char[]`? Is `dat` ASCII or non-ASCII? etc. – Alexander Pogrebnyak Jan 11 '13 at 21:47
  • @AlexanderPogrebnyak Nonetheless, the information he does provide seems to rule out the length discrepancy being caused by this encoding-related bug. The thing to do when given insufficient information is to request more information, not give a "wrong" (i.e. for the question) answer. – millimoose Jan 11 '13 at 23:20
1

The file length depends on encoding. This test

System.out.println(dat.getBytes().length);

will show the length in bytes after encoding, because String.getBytes will use the same encoding (default) as new FileWriter(file)

Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
  • Yes, it turns out 584 of the characters in the string were multi-byte characters, and that accounted for the difference in size. The actual error in the output is because the way they were read in shoved all the multi-byte characters into single bytes, leaving extra \0s at the end. – Chad Mourning Jan 13 '13 at 20:50
0

So run a test with a 63833144 long string with only 'A's in it and the output is: Data: 63833144, File: 63833144

So Im sure the problem is a encoding problem.

(I would have post this as comment but because I have not 50 rep Im not able to :/)

maxammann
  • 1,018
  • 3
  • 11
  • 17
  • I agree it is probably an encoding problem, I guess I'm just not sure how to fix it. I even tried making a UTF-8 writer as recommended below and had the same issue. – Chad Mourning Jan 11 '13 at 22:58
  • @ChadMourning Hm this does not explain the problem if this works but PrintStream print = new PrintStream(new FileOutputStream("test.txt")); print.print(s); – maxammann Jan 11 '13 at 23:08