7

I was trying to count the number of unique words in a text file. For the sake of simplicity, my current file content is:

This is a sample file

My attempt is:

long wordCount = 
    Files.lines(Paths.get("sample.txt"))
         .map(line -> line.split("\\s+"))
         .distinct()
         .count();
System.out.println(wordCount);

This compiles and runs fine, but results in 1, while it should be 5.

Eran
  • 387,369
  • 54
  • 702
  • 768
  • 3
    Possible duplicate of [How to count words in a text file, java 8-style](https://stackoverflow.com/questions/47594679/how-to-count-words-in-a-text-file-java-8-style) – Julien Lopez Jan 09 '19 at 07:21

2 Answers2

12

You are mapping each line to an array (transforming a Stream<String> to a Stream<String[]>, and then count the number of array elements (i.e. the number of lines in the file).

You should use flatMap to create a Stream<String> of all the words in the file, and after the distinct() and count() operations, you'll get the number of distinct words.

long wordCount = 
    Files.lines(Paths.get("sample.txt"))
         .flatMap(line -> Arrays.stream(line.split("\\s+")))
         .distinct()
         .count();
Eran
  • 387,369
  • 54
  • 702
  • 768
  • 1
    It might be more efficient not to scan for line breaks when you only want to count words, i.e. in Java 9: `new Scanner(Paths.get("sample.txt")) .findAll("\\S+") .map(MatchResult::group) .distinct() .count()`. Another advantage of this approach is that it won’t treat empty lines as words. In either case, whether you use `Files.lines` or `Scanner.find`, the resource should be closed after use in production code. – Holger Jan 09 '19 at 17:15
7

You seem to be counting the lines in your file instead :

map(line -> line.split("\\s+")) // this is a Stream<String[]>

You shall further use Stream.flatMap as:

long wordCount = Files.lines(Paths.get("sample.txt"))
        .map(line -> line.split("\\s+"))
        .flatMap(Arrays::stream)
        .distinct()
        .count();
Naman
  • 27,789
  • 26
  • 218
  • 353