Java wordcount: a mediocre implementation

Question

I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).

Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:

Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS) to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.

I am open to other possibilities as well, this is just what comes to mind.

Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.

Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.

    public class BigCount2 {
        public static void main(String[] args) throws IOException, InterruptedException {
            int num, counter;
            long i, j;
            String delimiterString = " ";
            ArrayList<Character> delim = new ArrayList<Character>();
            for (char c : delimiterString.toCharArray()) {
                delim.add(c);
            }
            int counter2 = 0;
            num = Integer.parseInt(args[0]);
            int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
            int remainder = bytesToRead % num;
            int k = 0;
            bytesToRead = bytesToRead - remainder;
            int byr = bytesToRead / num;
            String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
            RandomAccessFile file = new RandomAccessFile(filepath, "r");
            Thread[] t = new Thread [num];//array of threads
            ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
            byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
            char[] newbyte;
            for (i = 0; i < file.length(); i += bytesToRead) {
                counter = 0;
                for (j = 0; j < bytesToRead; j += byr) {
                    file.seek(i + j);
                    file.read(byteArray, 0, byr);
                    newbyte = new String(byteArray).toCharArray();
                    t[counter] = new Thread(
                            new BigCountThread2(counter, 
                                newbyte, 
                                delim, 
                                wordCountMap));//giving each thread t[i] different file fileReader[i] 
                    t[counter].start();
                    counter++;
                    newbyte = null;
                }
                for (k = 0; k < num; k++){
                    t[k].join(); //main thread continues after ALL threads have finished. 
                }
                counter2++;
                System.gc();
            }
            file.close();
            System.exit(0);
        }
    }   

class BigCountThread2 implements Runnable {
    private final ConcurrentMap<Integer, Integer> wordCountMap;
    char [] newbyte;
    private ArrayList<Character> delim;
    private int threadId; //use for later
    BigCountThread2(int tid, 
            char[] newbyte, 
            ArrayList<Character> delim,
            ConcurrentMap<Integer, Integer> wordCountMap) { 
        this.delim = delim;
        threadId = tid;
        this.wordCountMap = wordCountMap;
        this.newbyte = newbyte;
    }
    public void run() {
        int intCheck = 0;
        int counter = 0; int i = 0; Integer check;  int j =0; int temp = 0; int intbuilder = 0;
        for (i = 0; i < newbyte.length; i++) {
            intCheck = Character.getNumericValue(newbyte[i]);
            if (newbyte[i] == ' ' || intCheck == -1) {    //once a delimiter is found, the current tempArray needs to be added to the MAP
                check = wordCountMap.putIfAbsent(intbuilder, 1);
                if (check != null) { //if returns null, then it is the first instance
                    wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
                }

                intbuilder = 0;
            }

            else {
                intbuilder = (intbuilder * 10) + intCheck;
                counter++;
            }

        }
    }
}

You don't appear to check that your blocks don't split words in half. From my experience the bottleneck for large files is the speed you can read them from disk, not CPU. If you make the code efficient enough, one thread will do the job. Note: if you always read from file cache, e.g. repeatedly, the disk speed doesn't matter. — Peter Lawrey, Jun 10 '14 at 06:14
Ah yes I knew I forgot to mention something. That is very true, I was ignoring that case for the time being until I make a lot of other fixes. The bottleneck I believe is CPU if you are using an SSD (which I am). — DanGordon, Jun 10 '14 at 06:55
I would write the simplest version of word count with one thread, one HashMap, and see how long the simple version takes. How else are you going to know how much of an improvement multiple threads provides? — Gilbert Le Blanc, Jun 10 '14 at 07:06
Also I would avoid using Strings for your buffers or split(), that will save you about half your CPU though it will make the code more complicated. — Peter Lawrey, Jun 10 '14 at 07:10
Please consider posting such a question on [Codereview](http://codereview.stackexchange.com/). I don't think your question is bad, it just seems more on topic for Codereview. — Viktor Seifert, Jun 10 '14 at 12:29

user2864740 · Accepted Answer · 2014-06-10T07:10:44.470

Some thoughts on a little of most ..

.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).

If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.

However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.

The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..

This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)

I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?

It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.

Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.

Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.

Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.

I thought about using `BufferedReader` per thread - but I'm not sure how that would work. Each `BufferedReader` would be pointing to the beginning of the file (in the beginning), which means each thread would be grabbing the same data. How can thread 1's `BufferedReader` read the first X bytes, and thread 2's `BufferedReader` read the next X + X bytes? I don't remember being able to offset when using `Buffered Reader`. You are correct about the put/get not being atomic, is there a way to fix that without making each thread use it's own map? — DanGordon, Jun 10 '14 at 16:22
Also, how would I go about putting all the maps together efficiently? — DanGordon, Jun 10 '14 at 16:23
@DanGordon Use [`Reader#skip`](http://docs.oracle.com/javase/7/docs/api/java/io/Reader.html#skip(long)) - in this case I'd likely tell the FileReader to skip, and then open the BufferedReader from that point. Then just keep track of how much has been read. — user2864740, Jun 10 '14 at 19:34
@DanGordon The easiest way to make something atomic is to wrap the entire construct in a `synchronized`. It might be possible to employ an AtomicInteger in this case as well, but sounds like too much work. — user2864740, Jun 10 '14 at 19:34
@DanGordon The [`Map#putAll`](http://docs.oracle.com/javase/7/docs/api/java/util/Map.html#putAll(java.util.Map)) can be used to merge Maps. HashMaps will do so efficiently here - or O(n). — user2864740, Jun 10 '14 at 19:36

Java wordcount: a mediocre implementation

1 Answers1