13
      try (Stream<String> lines = Files.lines(targetFile)) {  
     List<String> replacedContent = lines.map(line ->  
                                       StringUtils.replaceEach(line,keys, values))
                                       .parallel()
                                       .collect(Collectors.toList());
    Files.write(targetFile, replacedContent);
}

I'm trying to replace multiple text patterns in each line of the file. But I'm observing that "\r\n"(byte equivalent 10 and 13) is being replaced with just "\r"(just 10) and my comparison tests are failing.

I want to preserve the newlines as they are in the input file and don't want java to touch them. Could anyone suggest if there is a way to do this without having to use a separate default replacement for "\r\n".

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
A.R.K.S
  • 1,692
  • 5
  • 18
  • 40
  • Sorry for missing that. Just added. – A.R.K.S Feb 10 '16 at 19:24
  • Just removed the replaceEach to isolate the issue and its the Files.line() which seems to be doing this. – A.R.K.S Feb 10 '16 at 19:30
  • 1
    Where is the replace happening? The code you pasted created a List of String, it does not have any newline character. – Mrinal Feb 10 '16 at 19:30
  • ` StringUtils.replaceEach()` does the replacement on each `line` where `keys` are found. `keys` are replaced with `values`. and these `line`s are from a file - `targetFile`. I generate a list of strings and write them all to a file. – A.R.K.S Feb 10 '16 at 19:32
  • 2
    You are saying that “`"\r\n"` … is being replaced with just `"\r"`”. The question is where that does happen as streams don't do that. The string produced by `Files.lines` don’t have any line breaks at all. – Holger Feb 10 '16 at 19:35
  • Got it. I have to add them explicitly. Problem is I'm writing the list to the file and they don't end with newline as streams don't include them for me. – A.R.K.S Feb 10 '16 at 19:36
  • What operating system are you running on? – Code-Apprentice Feb 10 '16 at 19:38
  • I have a hypothesis: `Files.write()` adds "end of line" characters as it writes each "line" from the given list. The precise "end of line" sequence used is dependent on the OS you are using. Since you see only `"\r"`, I guess you are on Mac OS. – Code-Apprentice Feb 10 '16 at 19:39
  • 2
    [`Files.write`](http://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#write-java.nio.file.Path-java.lang.Iterable-java.nio.file.OpenOption...-) can write a list of strings as lines and will add the system specific line break for each line. On Windows, it should be the desired `\r\n` sequence. – Holger Feb 10 '16 at 19:39
  • Which version of Windows? – Code-Apprentice Feb 10 '16 at 19:40

2 Answers2

16

The problem is that Files.lines() is implemented on top of BufferedReader.readLine(), which reads a line up until the line terminator and throws it away. Then, when you write the lines with something like Files.write(), this supplies the system-specific line terminator after each line, which might differ from the line terminator that was read in.

If you really want to preserve the line terminators exactly as they are, even if they're a mixture of different line terminators, you could use a regex and Scanner for that.

First define a pattern that matches a line including the valid line terminators or EOF:

Pattern pat = Pattern.compile(".*\\R|.+\\z");

The \\R is a special linebreak matcher that matches the usual line terminators plus a few Unicode line terminators that I've never heard of. :-) You could use something like (\\r\\n|\\r|\\n) if you want just the usual CRLF, CR, or LF terminators.

You have to include .+\\z in order to match a potential last "line" in the file that doesn't have a line terminator. Make sure the regex always matches at least one character so that no match will be found when the Scanner reaches the end of the file.

Then, read lines using a Scanner until it returns null:

try (Scanner in = new Scanner(Paths.get(INFILE), "UTF-8")) {
    String line;
    while ((line = in.findWithinHorizon(pat, 0)) != null) {
        // Process the line, then write the output using something like
        // FileWriter.write(String) that doesn't add another line terminator.
    }
}
Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
  • Stuart and others, I think I can't use scanner with multithreaded program right? Is there any other way to achieve this for multithreaded programs? – A.R.K.S Mar 18 '16 at 01:21
  • @AshwiniR You can use a single `Scanner` instance from only one thread at a time in a multithreaded program. Multiple threads can use different `Scanner` instances, as long as no two threads operate on the same instance. If you want to process lines from a single file in parallel, this is difficult, since reading the file and writing the output is sequential. It's probably only worth running in parallel if there is a large amount of computation for each line. – Stuart Marks Mar 18 '16 at 02:28
  • Thanks Stuart. I create a `Scanner` instance within a thread. This instance reads all the lines one by one, creates a list of lines and closes the scanner. Any other thread running in parallel to this thread'll have its own instance of `Scanner`. So I don't need to worry about `Scanner` being thread-unsafe or about synchronizing the method in which I use `Scanner` right? – A.R.K.S Mar 18 '16 at 02:44
  • Processing many lines at a time: I create an array of callables (each callable is executed by one thread to process a set of lines at a time) & then do invokeAll() on this array. Results will be collected in an array so this maintains the order and I can use it to write. This worked fine in my testing. Do you see any issues with the approach? – A.R.K.S Mar 18 '16 at 02:50
3

The lines in your stream do not include any newline character.

It would be nice if the method documentation for Files.lines() mentioned this. However, if you follow the implementation, it eventually leads to BufferedReader.readLine(). That method is documented to return the contents of the line, not including any line-termination characters.

You can add a newline character to the lines when you write them.

A system-dependent line separator is used by the Files.write() method you're calling, as documented in its sibling. You can also get this system-dependent line separator with System.lineSeparator().

If you want a different line separator, and know what it is, you can specify it. For example:

    try ( PrintStream out = new PrintStream( Files.newOutputStream( targetFile ))) 
    {
        lines.forEach( line -> out.print( line + "\r\n") );
    }

If you want the original file's line separators, you can't rely only on a method that strips those out. Options include:

  • Reading the first line separator, and guessing that it's consistent throughout the file. This allows you to continue to use Files.lines() to read the lines.
  • Use an API that allows you to get lines with their separators.
  • Read character-by-character, rather than line-by-line, so that you can get the line separators.

WARNING: Your code reads and writes from the same file. You could lose your original data due to abnormal termination or bugs.

Andy Thomas
  • 84,978
  • 11
  • 107
  • 151
  • It appears that `Files.write()` adds the "end of line" sequence as it writes each line in the given list. – Code-Apprentice Feb 10 '16 at 19:41
  • I think the Files.write is adding them but its adding only "\r" only. My input file has "\r\n". I dont see a way to change this in Files.write() !! – A.R.K.S Feb 10 '16 at 19:54
  • @AshwiniR - You might be able to do that by setting the `line.separator` property, but that hack affects the entire process. Using a mechanism other than `Files.write()` may be preferable. See an example in the edited text above. Also note the warning added after your comment. – Andy Thomas Feb 10 '16 at 20:06