0

I am iterating through several text files and I am trying to find the top 20 words amongst all the text files. I have managed to set up some code to find the top 20 words in a single file. However, now I am struggling with several files.

I have a global linked-hashmap where I want to store every new word (as a key) I come across in a text file and I want to update its value (the number of times it occurs) as I come across more of the word. For example in the first file, I find 8000 instances of the word "the" and in the next file I come across 7000 instances of "the" in another file then I want the value of the key "the" to be updated to 15000.

Here is my code:

import java.util.*;
import java.util.stream.Collectors;
import java.io.IOException;
import java.nio.file.*;
import java.util.Map.Entry;
import java.util.function.Function;
import java.io.File;
import java.nio.charset.StandardCharsets;

public class FileReaderTwo
{
    static LinkedHashMap<String, Long> top20Words = null;
    public static void main(String args[])
    {
        File dir = new File("data/");
        for (File file : dir.listFiles()) 
        {
            try
            {
                top20Words = Files.lines(Paths.get(file.toString()), StandardCharsets.ISO_8859_1)
                        .flatMap(line -> Arrays.stream(line.toLowerCase().split("[\\(,\\).\\s+]+")))
                        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).entrySet().stream()
                        .sorted(Entry.comparingByValue(Comparator.reverseOrder()))
                        .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
                        .collect(Collectors.toMap(Entry::getKey, Entry::getValue, (u, v) -> u, LinkedHashMap::new));
            } catch (IOException e)
            {
                e.printStackTrace();
            }
        }
        System.out.println(top20Words);
    }
}

Note: I know that at the moment it prints out every word, I wanted to deal with this issue first and fix that later.

  • 1
    Are you *sure* you wanted that `+` *inside* the character class in the regex? `[\(,\).\s+]+`? --- Also, parentheses are not special inside a character class, so they don't need escaping. I think you simply meant `[(),.\s]+`. --- Perhaps you want all non-letter characters other than `'` and `-`? If so, specify the characters you want to keep, then negate that, e.g. `[^\p{L}\p{N}'\-]+` – Andreas Sep 24 '20 at 01:47

2 Answers2

0

Okay, I modified this to do what I believe you are looking for. It works as follows.

  • created two methods to facilitate handling exceptions for file access. Not only do I find this easier and cleaner it is the recommended approach as opposed to trying to position try class constructs within a stream.
  • takes all the words and does a frequency count and stores them in a map.
  • sorts the entrySet of the map and places the top 20 words (highes wordcount) in the map in descending order.

The overall result is to count all the words in multiple files and present the tally in descending order.

import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Map.Entry;
import java.util.stream.Collectors;
import java.util.stream.Stream;

public class FileWordCount {
    
    public static void main(String[] args) {
        FileWordCount fwc = new FileWordCount();
        Map<String,Long> map = fwc.getTheWords();
        map.entrySet().forEach(System.out::println);
    }
    
    // helper methods to handle exceptions.
    private  Stream<Path> getFiles(String dir) {
        try {
            return Files.list(Path.of(dir));
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
    
    private  Stream<String> getLines(Path path) {
        try {
            return Files.lines(path,StandardCharsets.ISO_8859_1);
        } catch (Exception e) {
            e.printStackTrace();
        }
        return null;
    }
        
    
    public Map<String, Long>  getTheWords() {
        
        String dir = "f:./data";
    
        return getFiles(dir)
              .flatMap(this::getLines) 
                .flatMap(line -> Arrays.stream(
                        line.toLowerCase().split("[\\(,\\).\\s+]+")))
                .collect(Collectors.groupingBy(word -> word,
                        Collectors.counting())) 
                .entrySet().stream() 
                .sorted(Entry.<String,Long>comparingByValue().reversed().
                        thenComparing(Entry.<String,Long>comparingByKey()))
                .limit(20) // limts the number of entries
                .collect(Collectors.toMap(Entry::getKey, Entry::getValue,
                        (r,u)->r,
                        LinkedHashMap::new));

    }
}

Note. I sorted first on count in reverse order and then, if there was a tie, I sorted alphabetically in normal order.

WJS
  • 36,363
  • 4
  • 24
  • 39
  • Thank you for the response. However, I do not know how I would sort this into one singular LinkedHashMap ( I am a novice at this kind of stuff). This seems to be in the format >. I am trying to have just one LinkedHashMap with the format where String is the word and Long is the number of times it occurs across all files together. – Mikolas Slama Sep 24 '20 at 02:53
0

First, don't mix old File API with new NIO.2 API.

You combine the results from all the files by starting with a Stream of files.

Path dir = Paths.get("data/");
LinkedHashMap<String, Long> top20Words = Files.list(dir)
    .filter(path -> ! Files.isDirectory(path))
    .flatMap(file -> {
        try {
            return Files.lines(file, StandardCharsets.ISO_8859_1);
        } catch (IOException e) {
            e.printStackTrace();
            return Stream.empty();
        }
    })
    // the rest is copied from question, to show context
    .flatMap(line -> Arrays.stream(line.toLowerCase().split("[\\(,\\).\\s+]+")))
    .collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).entrySet().stream()
    .sorted(Entry.comparingByValue(Comparator.reverseOrder()))
    .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
    .collect(Collectors.toMap(Entry::getKey, Entry::getValue, (u, v) -> u, LinkedHashMap::new));
System.out.println(top20Words);
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • Does this turn all the files into a singular stream? I am not sure if that is what I am looking for because I would eventually like to implement threads where each thread would tackle one file and updates a global LinkedHashMap. – Mikolas Slama Sep 24 '20 at 02:55
  • @MikolasSlama Then add `.parallel()` e.g. before the `filter()` call, and each file will be processed by a separate thread. When the last file has been processed in full, the `collect()` call will return the one-and-only "global" LinkedHashMap. – Andreas Sep 24 '20 at 03:00
  • I will have to look into parallel(). I am a novice so I apologize if I am asking too many questions to a given answer. When I try to run the code you have provided it does not compile. It says there is an unreported exception for Files.list(dir) even with the try-catch you have provided. – Mikolas Slama Sep 24 '20 at 03:16
  • @MikolasSlama `Files.list()` is outside that `try-catch`. You need another, or add a `throws` to your method. – Andreas Sep 24 '20 at 08:40
  • 1
    There is no need to sort twice. And I think, it’s cheaper to convert the smaller strings, i.e. remove the `toLowerCase()` before the `split` and replace `Function.identity()` by `String::toLowerCase` in the `groupingBy` collector. – Holger Sep 24 '20 at 16:42
  • @Holger Great comment ... for the question. This answer is about combining the `Files.list()` and `Files.lines` streams into a single stream. The rest of the stream chain is copied straight from the question. I suggest you leave the comment there. – Andreas Sep 24 '20 at 18:56
  • 2
    @Andreas I know that this is not the focus of the answer. But when I leave a comment at the question and the OP removes the redundant `sorted`, your answer will suddenly look as if you added a `sorted` step, confusing the readers. So I prefer leaving the comment at a place where both, the questioner and the answerer who copied the code, get a notification. Generally, I think, it’s worth pointing out obvious issues even when not being the question’s focus. Would you tell someone who asks “does the road lead to …” a simple “yes, it does”, because “they didn’t ask whether the bridge is safe…”? – Holger Sep 25 '20 at 06:34