0

I try to write a program that counts all the words in text file. I put any word that matches the patterns in TreeMap.

The text file I get through args0

For example, the text file contains this text: The Project Gutenberg EBook of The Complete Works of William Shakespeare

The condition that checks if the TreeMap already has the word, return false for the second appearance of word The, but returns true the second appearance of word of.

I don't understand why...
This is my code:

public class WordCount
{
    public static void main(String[] args)
    {
        // Charset charset = Charset.forName("UTF-8");
        // Locale locale = new Locale("en", "US");

        Path p0 = Paths.get(args[0]);
        Path p1 = Paths.get(args[1]);
        Path p2 = Paths.get(args[2]);

        Pattern pattern1 = Pattern.compile("[a-zA-Z]");
        Matcher matcher;
        Pattern pattern2 = Pattern.compile("'.");

        Map<String, Integer> alphabetical = new TreeMap<String, Integer>();

        try (BufferedReader reader = Files.newBufferedReader(p0))
        {
            String line = null;

            while ((line = reader.readLine()) != null)
            {
                // System.out.println(line);
                for (String word : line.split("\\s"))
                {
                    boolean found = false;

                    matcher = pattern1.matcher(word);
                    while (matcher.find())
                    {
                        found = true;
                    }
                    if (found)
                    {
                        boolean check = alphabetical.containsKey(word.toLowerCase());
                        if (!alphabetical.containsKey(word.toLowerCase()))
                            alphabetical.put(word.toLowerCase(), 1);
                        else
                            alphabetical.put(word.toLowerCase(), alphabetical.get(word.toLowerCase()).intValue() + 1);
                    }
                    else
                    {
                        matcher = pattern2.matcher(word);
                        while (matcher.find())
                        {
                            found = true;
                        }
                        if (found)
                        {
                            if (!alphabetical.containsKey(word.substring(1, word.length())))
                                alphabetical.put(word.substring(1, word.length()).toLowerCase(), 1);
                            else
                                alphabetical.put(word.substring(1, word.length()).toLowerCase(), alphabetical.get(word).intValue() + 1);
                        }
                    }
                }
            }
}
Asaf
  • 107
  • 1
  • 12
  • What is the purpose of `boolean check`, it's not used anywhere! BTW, I've just tried your code (but reading from file not using `args`) and it works fine. – Yahya Jun 22 '17 at 13:54
  • I know that it's not used anywhere, I just use this variable in debug mode. The `while (matcher.find()) { found = true; }` is to check if the word is full match to the pattern. Any help? – Asaf Jun 22 '17 at 13:58
  • You don't need the `while loop`, I tried your code and it's working as expected! It gives the expected output and the `boolean check` is working as expected as well! – Yahya Jun 22 '17 at 14:03
  • if I use `String str = "The Project Gutenberg EBook of The Complete Works of William Shakespeare";` so the `if (!alphabetical.containsKey(word.toLowerCase()))` returns `true` for the second appearance of word **The** but if I use `String word : line.split("\\s")` I get `false`. Why? – Asaf Jun 22 '17 at 14:09
  • How do you know it returns `true`? the `check` code doesn't contain the negation operand! Try to write `System.out.println(word + ": " + check);` exactly under the `check` line and tell me what is the result. – Yahya Jun 22 '17 at 14:13
  • `?The: false Project: false Gutenberg: false EBook: false of: false The: false` This from the text file. I don't know why I get `?` – Asaf Jun 22 '17 at 14:18

1 Answers1

0

I've tested your code, it is ok. I think you have to check your file encoding.

It is certainly in "UTF-8". Put it in "UTF-8 without BOM", and you'll be OK !

Edit : If you can't change the encoding, you can do it manually. See this link : http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

Regards

zThulj
  • 149
  • 7
  • my file is `UTF-8` How it's related? – Asaf Jun 22 '17 at 14:32
  • You get a ? at the start of the file when reading it.It is the byte order mark I think. Change its encoding for UTF-8 without BOM and it should be OK. – zThulj Jun 22 '17 at 14:33
  • If I open the text file with `Notepad` I only see that the file is `UTF-8`, but when I open the file with `Notepad++` I see that my file is `UTF-8-BOM` When I change it to `UTF-8` it's works. But my input text file isn't for change because the person that will check my program don't want to make this changes. How can I progress? – Asaf Jun 22 '17 at 14:44
  • if I use : `newBufferedReader(Path, Charset)` and `Charset charset = Charset.forName("UTF-8-BOM");` so I get : `Exception in thread "main" java.nio.charset.UnsupportedCharsetException: UTF-8-BOM at java.nio.charset.Charset.forName(Unknown Source)` What I need to do? – Asaf Jun 22 '17 at 15:27
  • You can see here how to handle the bom in java (you have to do it manually) : http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html – zThulj Jun 22 '17 at 15:33
  • Thanks a lot!! I add this code before the `while` : `// BOM marker will only appear on the very beginning reader.mark(4); if ('\ufeff' != reader.read()) reader.reset(); // not the BOM marker` and now it's works well. – Asaf Jun 22 '17 at 16:10