BufferedReader messed up by different line seperators

Question

I'm having a buffered reader streaming a file. There are two cases right now:

It is streaming a file generated on one PC, let's call it File1. It is streaming a file generated on another Computer, let's call it File2.

I'm assuming my problem is caused by the EOLs.

BufferedReader does read both files, but for the File2, it reads an extra empty line for every new line.

Also, when I compare the line using line.equalsIgnoreCase("abc"), given that the line is "abc" it does not return true.

Use this code together with the two files provided in the two links to replicate the problem:

public class JavaApplication {

/**
 * @param args the command line arguments
 */
public static void main(String[] args) throws IOException {
    File file = new File("C:/Users/User/Downloads/html (2).htm");
    BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"));
    String line = "";

    while ((line = in.readLine()) != null) {
        System.out.println(line);
    }
}

File1, File2

Note how the second file prints an empty line after each line...

I've been searching and trying and searching and trying, and couldn't come up with a solution.

Any ideas how to fix that? (Especially the compare thing?)

I think there is a mix-up between `\r`, CR, 0x0d and `\n`, LF, 0x0a. Because if lines were ending on `\n\r` instead, BufferedReader would recognize two lines: ending in `\n` = Unix style, and `\r` = old Mac style, whereas `\r\n` would have been Windows style. — Joop Eggen, Apr 12 '16 at 19:21
Well, it would make more sense if `\r\n` would make it read an empty line every other line, but it's `\r` that's doing that... I used this answer to find out the separator, maybe it is not fully functional either ^^ http://stackoverflow.com/a/13828045/3653975 — Maverick283, Apr 12 '16 at 19:26
@JoopEggen yet what is the problem? Updated answer to show the code I use to get the new line, and also to compare the lines... Help is much appreciated — Maverick283, Apr 12 '16 at 19:27
@SaviourSelf that wouldn't help anything as the line comes from the bufferedReader, so I cannot replace anything prior to reading it line by line ;) — Maverick283, Apr 12 '16 at 19:30
@Maverick283 Are the file contents too large to hold in memory? Can you read the entire file contents into a `StringBuilder()` ? — Clark Kent, Apr 12 '16 at 19:32
@SaviourSelf both files are almost equal in size... Since one works and the smaller one doesn't this shouldn't affect the result either... — Maverick283, Apr 12 '16 at 19:36
The first shows UTF-8, the other UTF-16. UTF-16 erroneaously read as UTF-8 for ASCII contains a nul byte (char) after every byte (char). **That is the solution.** It is also specified as such in the HTML, charset=... — Joop Eggen, Apr 12 '16 at 20:56
@JoopEggen But both files have `charset=UTF-16`, don't they? Yet that would explain why a Levenshtein returned big values when I tried to compare the two... — Maverick283, Apr 12 '16 at 21:00
Yes, but the sizes are very different, and you are reading them as UTF-8. One should read them probably as UTF-16LE, My program's editor Kate detected UTF-8 for the small, UTF-16LE for the larger one. A small test would be to do `line = line.replace("\u0000", "");` — Joop Eggen, Apr 12 '16 at 21:04
Ohhh I see where you're coming from... Yeah, that UTF-8 thing happened when I tried to get them uploaded... The originals are both the same encoding though... — Maverick283, Apr 12 '16 at 21:04
I must go to bed ;) - maybe do a hex dump or such. Good luck — Joop Eggen, Apr 12 '16 at 21:05

score 1 · Answer 1 · edited May 23 '17 at 11:59

Works for me.

public class CRTest
{
   static StringReader test = new StringReader( "Line 1\rLine 2\rLine 3\r" );
   public static void main(String[] args) throws IOException {
      BufferedReader buf = new BufferedReader( test );
      for( String line = null; (line = buf.readLine()) != null; )
         System.out.println( line );
   }
}

Prints:

run:
Line 1
Line 2
Line 3
BUILD SUCCESSFUL (total time: 1 second)

As Joop said, I think you've mixed up which file isn't working. Please use the above skeleton to create an MCVE and show us exactly what file input isn't working for you.

Since you appear to have a file with reversed \r\n lines, here's my first attempt at a fix. Please test it, I haven't tried it yet. You need to wrap your InputStreamReader with this class, then wrap the BufferedReader on the outside like normal.

class CRFix extends Reader
{

   private final Reader reader;
   private boolean readNL = false;

   public CRFix( Reader reader ) {
      this.reader = reader;
   }

   @Override
   public int read( char[] cbuf, int off, int len )
           throws IOException
   {
      for( int i = off; i < off+len; i++ ) {
         int c = reader.read();
         if( c == -1 )
            if( i == off ) return -1;
            else return i-off-1;
         if( c == '\r' && readNL ) { 
            readNL = false;
            c = reader.read();
         }
         if( c == '\n' ) 
            readNL = true;
         else 
            readNL = false;
         cbuf[i] = (char)c;
      }
      return len;
   }

   @Override
   public void close()
           throws IOException
   {
      reader.close();
   }

}

Alright, I guess the method i used to figure out which one uses which returns an inverse result... Not sure though. Assuming it actually IS the other way around, how do I get it to read only one line rather then two when the line seperator is `\r\n`? — Maverick283, Apr 12 '16 at 19:45
I've (finally) updated my question with a sample, hope it does what it is supposed to do... — Maverick283, Apr 12 '16 at 19:59
I tested the CRFix class and I fixed how I detect end of file. It should work now. But as you mention the files you posted don't have the problem you describe. It's likely your problem is somewhere else. — markspace, Apr 12 '16 at 20:31
There we go, try File2 again, it downloads and replicates the problem, with and without CRFix using the code provided in my question — Maverick283, Apr 12 '16 at 20:42

score 0 · Accepted Answer · answered Apr 13 '16 at 12:44

0

Joop was right, after some more research it seems like, even though both files have specified a UTF-16 encoding in their header, one was encoded in UTF-16, and the other (File1) in UTF-8. This lead to the "double line effect". Thanks for the effort that was put in answering this question.

answered Apr 13 '16 at 12:44

Maverick283

1,284
3
16
33

I believe you can mark your own answer as correct. You should do that if this was the actual answer. – markspace Apr 13 '16 at 23:28
I was gonna do that but you need to wait one day until your able to do so ;) Thanks again for all your help! – Maverick283 Apr 15 '16 at 05:41

BufferedReader messed up by different line seperators

2 Answers2