Java Parallel File Processing

Question

I have following code:

import java.io.*;
import java.util.concurrent.* ;
public class Example{
public static void main(String args[]) {
    try {
        FileOutputStream fos = new FileOutputStream("1.dat");
        DataOutputStream dos = new DataOutputStream(fos);

        for (int i = 0; i < 200000; i++) {
            dos.writeInt(i);
        }
        dos.close();                                                         // Two sample files created

        FileOutputStream fos1 = new FileOutputStream("2.dat");
        DataOutputStream dos1 = new DataOutputStream(fos1);

        for (int i = 200000; i < 400000; i++) {
            dos1.writeInt(i);
        }
        dos1.close();

        Exampless.createArray(200000); //Create a shared array
        Exampless ex1 = new Exampless("1.dat");
        Exampless ex2 = new Exampless("2.dat");
        ExecutorService executor = Executors.newFixedThreadPool(2); //Exexuted parallaly to cont number of matches in two file
        long startTime = System.nanoTime();
        long endTime;
        Future<Integer> future1 = executor.submit(ex1);
        Future<Integer> future2 = executor.submit(ex2);
        int count1 = future1.get();
        int count2 = future2.get();
        endTime = System.nanoTime();
        long duration = endTime - startTime;
        System.out.println("duration with threads:"+duration);
        executor.shutdown();
        System.out.println("Matches: " + (count1 + count2));

        startTime = System.nanoTime();
        ex1.call();
        ex2.call();
        endTime = System.nanoTime();
        duration = endTime - startTime;
        System.out.println("duration without threads:"+duration);

    } catch (Exception e) {
        System.err.println("Error: " + e.getMessage());
    }
}
}

class Exampless implements Callable {

public static int[] arr = new int[20000];
public String _name;

public Exampless(String name) {
    this._name = name;
}

static void createArray(int z) {
    for (int i = z; i < z + 20000; i++) { //shared array
        arr[i - z] = i;
    }
}

public Object call() {
    try {
        int cnt = 0;
        FileInputStream fin = new FileInputStream(_name);
        DataInputStream din = new DataInputStream(fin);      // read file and calculate number of matches
        for (int i = 0; i < 20000; i++) {
            int c = din.readInt();
            if (c == arr[i]) {
                cnt++;
            }
        }
        return cnt ;
    } catch (Exception e) {
        System.err.println("Error: " + e.getMessage());
    }
    return -1 ;
}

}

Where I am trying to count number of matches in an array with two files. Now, though I am running it on two threads, code is not doing well because:

(running it on single thread, file 1 + file 2 reading time) < (file 1 || file 2 reading time in multiple thread).

Can anyone help me how to solve this (I have 2 core CPU and file size is approx. 1.5 GB).

@SurajChandran, most of the times. And truly no effect.:) Just run a test. — Arpssss, Jul 31 '12 at 16:33
@Arpssss, I have added timing to the code listing, hope you don't mind that. On my machine, the threaded version always runs faster then the sequential, although the difference is not much. Like 48297747 nanoseconds for the threaded and 78930159 nanoseconds without threads. — bpgergo, Jul 31 '12 at 16:59
@bpgergo, Please increase the file size by increasing the for loop arguments. I just give an example. — Arpssss, Jul 31 '12 at 17:10

Tomasz Nurkiewicz · Accepted Answer · 2012-07-31T16:41:58.487

In the first case you are reading sequentially one file, byte-by-byte, block-by-block. This is as fast as disk I/O can be, providing the file is not very fragmented. When you are done with the first file, disk/OS finds the beginning of the second file and continues very efficient, linear reading of disk.

In the second case you are constantly switching between the first and the second file, forcing the disk to seek from one place to another. This extra seeking time (approximately 10 ms) is the root of your confusion.

Oh, and you know that disk access is single-threaded and your task is I/O bound so there is no way splitting this task to multiple threads could help, as long as your reading from the same physical disk? Your approach could only be justified if:

each thread, except reading from a file, was also performing some CPU intensive or blocking operations, slower by an order of magnitude compared to I/O.
files are on different physical drives (different partition is not enough) or on some RAID configurations
you are using SSD drive

+1. This is a fundamental problem that many people don't understand: only increasing the limiting reagent is going to increase performance. — RedGreasel, Jul 31 '12 at 16:53

score 1 · Answer 2 · answered Jul 31 '12 at 16:54

You will not get any benefit from multithreading as Tomasz pointed out from reading the data from disk. You may get some improvement in speed if you multithread the checks, i.e. you load the data from the files into arrays sequentially and then the threads execute the checking in parallel. But considering the small size of your files (~80kb) and the fact that you are just comparing ints I doubt the performance improvement will be worth the effort.

Something that will definitely improve your execution speed is if you do not use readInt(). Since you know you are comparing 20000 ints, you should read all 20000 ints into an array at once for each file (or at least in blocks), rather than calling the readInt() function 20000 times.

Java Parallel File Processing

2 Answers2