0

How would you evaluate the solution for the following task in regard of structure, correctness, simplicity, testability(task time approx. 1 hr):

Create a command-line Java program that counts unique words from a text file and lists the top 10 occurrences.

English locale and treating hyphen and apostrophe as part of a word, output should look like the following:

and (514)

the (513)

i (446)

to (324)

a (310)

of (295)

my (288)

you (211)

that (188)

this (185)

Solution:

WordCalculator.java(main class)

public class WordCalculator {

    /**
     * Counts unique words from a text file and lists the top 10 occurrences.
     *
     * @param args the command line arguments. First argument is the file path.
     * If omitted, user will be prompted to specify path.
     *
     * @throws java.io.FileNotFoundException if the file for some other reason
     * cannot be opened for reading.
     *
     * @throws java.io.IOException If an I/O error occurs
     */
    public static void main(String[] args) throws FileNotFoundException, IOException {

        File file;
        List<String> listOfWords = new ArrayList<>();

        // If a command argument is specified, use it as the file path.
        // Otherwise prompt user for the path.
        if (args.length > 0) {

            file = new File(args[0]);

        } else {

            Scanner scanner = new Scanner(System.in);
            System.out.print("Enter path to file: ");
            file = new File(scanner.nextLine());

        }

        // Reads the file and splits the input into a list of words
        try (BufferedReader br = new BufferedReader(new FileReader(file))) {

            String line;
            while ((line = br.readLine()) != null) {
            listOfWords.addAll(WordUtil.getWordsFromString(line));
            }

        } catch (FileNotFoundException ex) {

            Logger.getLogger(WordCalculator.class.getName()).log(Level.SEVERE,
                String.format("Access denied reading from file '%s'.", file.getAbsolutePath()), ex);
            throw ex;

        } catch (IOException ex) {

            Logger.getLogger(WordCalculator.class.getName()).log(Level.SEVERE,
                "I/O error while reading input file.", ex);
            throw ex;

        }

        // Retrieves the top ten frequent words and their frequencies.
        Map<Object, Long> freqMap = FrequencyUtil.getItemFrequencies(listOfWords);
        List<Map.Entry<?, Long>> topTenWords = FrequencyUtil.limitFrequency(freqMap, 10);

        // Prints the top ten words and their frequencies.
        topTenWords.forEach((word) -> {
        System.out.printf("%s (%d)\r\n", word.getKey(), word.getValue());
        });
    }
}

FrequencyUtil.java

public class FrequencyUtil {

    /**
     * Transforms a list into a map with elements and their frequencies.
     *
     * @param list, the list to parse
     * @return the item-frequency map.
     */
    public static Map<Object, Long> getItemFrequencies(List<?> list) {

        return list.stream()
                .collect(Collectors.groupingBy(obj -> obj,Collectors.counting()));

    }

    /**
     * Sorts a frequency map in descending order and limits the list.
     *
     * @param objFreq the map elements and their frequencies.
     * @param limit the limit of the returning list
     * @return a list with the top frequent words
     */
    public static List<Map.Entry<?, Long>> limitFrequency(Map<?, Long> objFreq, int limit) {

        return objFreq.entrySet().stream()
            .sorted(Map.Entry.comparingByValue(Comparator.reverseOrder()))
            .limit(limit)
            .collect(Collectors.toList());

    }

}

WordUtil.java

public class WordUtil {

    public static final Pattern ENGLISH_WORD_PATTERN = Pattern.compile("[A-Za-z'\\-]+");

    /**
     *
     * @param s the string to parse into a list of words. Words not matching the
     * english pattern(a-z A-z ' -) will be omitted.
     *
     * @return a list of the words
     *
     */
    public static List<String> getWordsFromString(String s) {

        ArrayList<String> list = new ArrayList<>();
        Matcher matcher = ENGLISH_WORD_PATTERN.matcher(s);

        while (matcher.find()) {

            list.add(matcher.group().toLowerCase());

        }

        return list;

    }

}
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Unknown
  • 136
  • 1
  • 5

1 Answers1

2

Your solution is correct but if you are looking for a less functional programming solution and more OOP. You should avoid use Utils classes with static methods. Instead of you can use your WordCalculator adding instance methods and properties as the map for count words. Also regex patterns are heavy to the performance op and you are doing loops (in a functional way) to add this splitted words to the map. Other option is read your file byte per byte and when you found a non alphabetical character (it the text file is simple is enough check for a whitespace) dump the word from a StringBuilder to the map and add 1 to the counter. With this you also avoid possible problems if the file is a huge one line text.

Update 1 - read words example added:

private void readWords(File file) {

    try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
        StringBuilder build = new StringBuilder();

        int value;
        while ((value = bufferedReader.read()) != -1) {
            if(Character.isLetterOrDigit(value)){
                build.append((char)Character.toLowerCase(value));
            } else {
                if(build.length()>0) {
                    addtoWordMap(build.toString());
                    build = new StringBuilder();
                }
            }
        }
        if(build.length()>0) {
            addtoWordMap(build.toString());
        }

    } catch(FileNotFoundException e) {
        //todo manage exception
        e.printStackTrace();
    } catch (IOException e) {
        //todo manage exception
        e.printStackTrace();
    }
}
Community
  • 1
  • 1
JunglarX
  • 21
  • 3
  • I see that util classes are bad practice in a OOP perspective... The reason I did not append the main class with any instances is because I want the main class to be focused in doing only one thing, "wiring it all up". I agree that Stringbuffer would be a faster and a better approach. I used Pattern in hope to find a built-in regex for english words but did not find anything ( I wasn't concerned with performance, I looked for simplicity). – Unknown Aug 15 '17 at 17:05
  • Your option regarding reading byte by byte from a file is not efficient( I guess each call to read() triggers a OS input request), one should use buffers. However, with buffers I would need to handle the situation when parsing a word that ends up in between two reads and I was afraid that might be more complex than just using read line(didn't think of buffered readers...).. In terms of memory effenciy, JVM uses atleast 16 bytes for each object instance(included Boolean) so I would not recommend having huge lists of words in java at all... – Unknown Aug 15 '17 at 17:08
  • 1
    If you read byte by byte from file without use buffers of course is very inefficient but using intermediate buffers it changes completely. When I said read byte by byte I refereed using always buffers and read from the buffers so calls to I/O are performed just when you reach the end of the buffer and it's necessary bring new data from disk (automatically managed by JVM). Using buffers you also can custom the buffers sizes. I updated the answer with a example of read and parse the words. However your solution is correct at all. Just depends from the observer. – JunglarX Aug 16 '17 at 09:14