Efficiently Process file comparison using Parallel Stream

Question

So, I have multiple txt files, say txt1,txt2,... and each line has some text between 4 and 22 characters and I have another txt file with similar values, say bigText. The goal is to check all values that are in bigTxt that occur somewhere in any of the txt files and output those values (we're guaranteed that if any line of bigTxt is in any of the txt files, a matching with that line only happens once). The best solution I have so far works, but is slightly inefficient. Basically, it looks like this:

txtFiles.parallelStream().forEach(file->{
   List<String> txtList = listOfLines of this txtFile;
   streamOfLinesOfBigTxt.forEach(line->{
         if(txtList.contains(line)){
            System.out.println(line);
            //it'd be great if we could just stop this forEach loop here
            //but that seems hardish
         }
   });
});

(Note: I tried breaking out of the forEach using Honza's "bad idea" solution here: Break or return from Java 8 stream forEach? but this must be doing something that's not what I want because it actually made the code usually a bit slower or about the same) The small problem with this is that even after one file has found a match of one of the lines between the bigTxt file and the other txt files, other txt files still try to search for checks with that line (even though we've already found one match and that's sufficient). Something that I tried to stop this was first iterating over the bigTxt lines (not in parallel, but going through each txt file was in parallel) and using java's anyMatch and I was getting a "stream has already been modified or closed" type of error which I understood later was because anyMatch is terminating. So, after just one call to anyMatch on one of the lines of one of the txt files, that stream was no longer available for my processing later. I couldn't think of a way to properly use findAny and I don't think allMatch is what I want either since not every value from bigTxt will necessarily be in one of the txt files. Any (parallel) solutions to this (even not strictly including things from Java 8) is welcome. Thank you.

Welcome to Stack Overflow! It seems that your code currently works, and you are looking to improve it. Generally these questions are too opinionated for this site, but you might find better luck at [CodeReview.SE](//codereview.stackexchange.com/tour). Remember to read [their requirements](//codereview.stackexchange.com/help/on-topic) as they are a bit more strict than this site. — 4castle, Jul 14 '17 at 00:46
This code is not “slightly inefficient”, it’s *horribly inefficient*, calling `contains` on a large `List` repeatedly. — Holger, Jul 14 '17 at 06:37

score 3 · Accepted Answer · answered Jul 14 '17 at 07:43

If streamOfLinesOfBigTxt is a Stream, you will get the same error with the code posted in your question, as you are trying to process that stream multiple times with your outer stream’s forEach. It’s not clear why you didn’t notice that, but perhaps you always stopped the program before it ever started processing the second file? After all, the time needed for searching the List of lines linearly for every line of the big file scales with the product of both numbers of lines.

When you say, you want “to check all values that are in bigTxt that occur somewhere in any of the txt files and output those values”, you could do exactly that straight-forwardly:

Files.lines(Paths.get(bigFileLocation))
     .filter(line -> txtFiles.stream()
                 .flatMap(path -> {
                         try { return Files.lines(Paths.get(path)); }
                         catch (IOException ex) { throw new UncheckedIOException(ex); }
                     })
                 .anyMatch(Predicate.isEqual(line)) )
    .forEach(System.out::println);

This does short-circuiting, but still has the problem of a processing time that scales with n×m. Even worse, it will re-open and read the txtfiles repeatedly.

If you want to avoid that, storing data in the RAM is unavoidable. If you store them, you can choose a storage that supports a better than linear lookup in the first place:

Set<String> matchLines = txtFiles.stream()
    .flatMap(path -> {
        try { return Files.lines(Paths.get(path)); }
        catch (IOException ex) { throw new UncheckedIOException(ex); }
    })
    .collect(Collectors.toSet());

Files.lines(Paths.get(bigFileLocation))
     .filter(matchLines::contains)
     .forEach(System.out::println);

Now, the execution time of this scales with the sum of the number of lines of all files rather than the product. But it needs a temporary storage for all distinct lines of the txtFiles.

If the big file has fewer distinct lines than the other files together and the order doesn’t matter, you store the lines of the big file in a set instead and check the lines of the txtFiles on the fly.

Set<String> matchLines
    = Files.lines(Paths.get(bigFileLocation)).collect(Collectors.toSet());

txtFiles.stream()
        .flatMap(path -> {
            try { return Files.lines(Paths.get(path)); }
            catch (IOException ex) { throw new UncheckedIOException(ex); }
        })
        .filter(matchLines::contains)
        .forEach(System.out::println);

This relies on the property that all matching lines are unique across all these text files, as you have stated in your question.

I don’t think, that there will be any benefit from parallel processing here, as the I/O speed will dominate the execution.

Efficiently Process file comparison using Parallel Stream

1 Answers1

Linked