1

This should be straight forward but for some reason when I try to count words in a file after I download it to my SD Card, the number seems to be off. Also the more occurrences there are, the further my result seems to be off. I use Microsoft Word to verify the number of occurrences (using ignore case and whole word only). To test the number of occurrences, I use the "the_counter" variable below. I also verified there is nothing wrong with download & the FULL file is downloaded to my SD card. This is driving me nuts -- I'm thinking Word cannot be wrong here so what could possibly be wrong with my code below?

Could it be white space or special chars in the file causing the problem --is there a way to clean the file to verify this?

//Find the directory for the SD Card using the API
        File sdcard = Environment.getExternalStorageDirectory();

        //Get the text file
        File file = new File(sdcard,TEMP_FILE);

        //Read text from file
        //StringBuilder text = new StringBuilder();
        m_tree = new Tree();
        int i=0;
        BufferedReader br = null;
        long the_counter=0;
        try {
            br = new BufferedReader(new FileReader(file));
            String line;
            String []arLine;
            while ((line = br.readLine()) != null) {
                //get each word in line
                if(line.length()==0)
                    continue;
                arLine = line.split("\\s+");

                //now add each word to search tree
                for(i=0;i< arLine.length;++i){
                    m_tree.insert(arLine[i]);
                    if(arLine[i].equalsIgnoreCase("a"))
                        ++the_counter;
                }
            }
           m_sTest = Long.toString(the_counter) ;
           br.close();

I edited my code to read in each character per line and create words manually. and I STILL GET THE SAME RESULT.

 br = new BufferedReader(new FileReader(file));
            String line;
            String []arLine;
            StringBuilder word = new StringBuilder();
            while ((line = br.readLine()) != null) {
                //check for word at end of last line
                if(word.length()>0){
                    m_tree.insert(word.toString());
                    word.setLength(0);
                }
                char[] lineChars = new char [line.length()];
                line.getChars(0,line.length(),lineChars,0);

                for(char c: lineChars){
                    if(c== ' '){
                        //if we have a word then store and clear then move on
                        if(word.length()>0){
                            m_tree.insert(word.toString());
                            word.setLength(0);
                        }
                    }
                    else{
                        word.append(c);
                    }
                }
Mike6679
  • 5,547
  • 19
  • 63
  • 108
  • anyone have any clue at all? – Mike6679 Jul 25 '14 at 12:19
  • What format is the file you are trying to read? If it is a microsoft word file try testing your application with a plain text file rather. – Willie Nel Jul 25 '14 at 13:07
  • It is a plain text file – Mike6679 Jul 25 '14 at 13:08
  • Okay. Perhaps try performing a trim() on the String elements before comparing them? if(arLine[i].trim().equalsIgnoreCase("a")) – Willie Nel Jul 25 '14 at 13:11
  • I thought you were on to something there but it did not work.hmmmm – Mike6679 Jul 25 '14 at 13:20
  • Perhaps you could consider chopping your file in half repeatedly until you find the half where the wrong result is achieved? Or you could modify your code to create a copy with the running count (inserted) and then transfer that and open it in word. If you start at the end and move back deleting everything that follows, you'll find the point where the (inserted) count disagrees from word's live one. – Chris Stratton Jul 25 '14 at 14:56
  • @Chris is my code for reading each word acceptable?....I used it many times before with no issue. – Mike6679 Jul 25 '14 at 15:38

1 Answers1

0

This is issue was that I was not accounting for special characters in between words: i.e: this-is-four-words and not one . I'm not even sure that is proper grammar or writing but it was in this file and it certainly threw off my count.

Mike6679
  • 5,547
  • 19
  • 63
  • 108