I have a big text file (size well above 1G) and I want to use Java to count the appearances of a certain word within that file. The text in the file is written on a single line so checking it line by line might not be possible. What would be the best way to tackle this problem?
-
whats the problem you are facing into that ? – Akash Yadav May 12 '12 at 05:37
-
I have tried to use BufferedReader to read the content line by line but after I have realized that there is actually no new line characters in between I have to use an alternative. I just hope that the size of the file won't turn out to be a heave burden for the Java program. – God_of_Thunder May 12 '12 at 05:59
-
You mean, a text file of about 1GB text with no new line char at the end? If yes, `readLine` will not work on it. You need to read in chunks. – Ravinder Reddy May 12 '12 at 06:28
4 Answers
You want to use the Scanner class of Java to consume that huge file word by word. Call the useDelimiter(...) method once to configure how your words shall be split (maybe just a space character) and afterwards loop over the file content using hasNext() and getNext().
For the counting itself, you can use a HashMap for simplicity.

- 21,797
- 8
- 68
- 88
-
Actually I just need to count one word. This is not about some statistics. – God_of_Thunder May 12 '12 at 06:08
-
2
-
@Kazekage Gaara Have I ever asked you to feed me? No. Don't be bothered to comment if this is what you think. – God_of_Thunder May 13 '12 at 15:40
-
@Bananeweizen Thanks for the idea. I have managed to use Scanner to achieve what I need to do. – God_of_Thunder May 14 '12 at 09:06
You can use slight variation of Trie data structure. This DS is used to create dictionary of words. Example you want to search for 'Stack', you can search trie by passing 'Sta' and it will return you all words starting with 'Sta'.
Now in your problem, you can traverse the file word by word and put that in the trie. Add additional field 'count' with every word. Now when you insert into the modified try you can increment the 'count'. Now you have counts for all the words in the trie.
I assume memory usage should not be too much as most of the words in your 1G file are repeated. You only have to traverse the file once. And also once you have this trie, you can search more than one word without performance penalty.
EDIT:
I have to agree with @Bananeweizen that HashMap is also a good solution, if you need exact matches. So read word by word and put in HashMap. The memory usage should be same as try.
You'll first need to sort the words so that they are in alphabetical order. There are a number of ways you can do this after reading in the data and splitting the words on spaces. You'd also need to remove special characters and punctuation marks prior to the sort.
Once sorted, the words you're targeting would all be side by side, which would then make your search a O(N) problem. At that point, you could then use a looping construct to go through and compare each word until you find the first instance of your word. At that point, you then continue looping, counting each word until you get to the next word.
At that point, you know there are no more instances of the word in your collection, and you can halt the search.
This particular search algorithm is O(N) worst case scenario. If your word is "apple", then the search is likely to complete much faster than if your word is "zebra".
There are other algorithms you can choose, depending on your exact needs.
I'm assuming by your question that this is a programming exercise and not an actual problem for work. If it's a problem for work, then this problem has already been solved countless times, and there are many search libraries out there for Java that will help you solve this problem, including tools in the Java standard library.

- 33,636
- 11
- 99
- 120
-
Well it is actually a problem in my work (I hope it would just be an exercise). I just want a feasible solution, as I doubt whether the memory consumption during program run would be too large. This program is merely a tool to justify the result of other programs so it would be executed on a normal desktop computer, not a server. – God_of_Thunder May 12 '12 at 06:05
-
It could cause the computer to slow a little bit, but as long as it has enough resources and as long as the JVM has enough resources allocated to it, you should be fine. Still, this algorithm would be much faster with C++ I believe, as each word could be assigned to a pointer. It's much faster to sort pointers to Strings than actual Strings themselves... – jamesmortensen May 12 '12 at 06:35
-
Maybe it works better with C++, but efficiency is not really a concern here. All I need from this program is to check if the layout in that file is what I want. So it will only be executed for a couple of times and I have no more use of it. – God_of_Thunder May 13 '12 at 15:29
You can build some text index using external tool. And after that you will be able find count different words in this index quickly. E.g. you can get Lucene for building such index. And then simpe get frequency of terms in it. There was similar question counting the word frequency in lucene index with links to articles and code examples.
-
1There are much simpler, non-external solutions to this problem – Hunter McMillen May 12 '12 at 05:38