Extra bytes appearing when building file data using multiple threads

Question

I am working on a large scale dataset and after building a model, I use multithreading (whole project in Java) as follows:

OutputStream out = new BufferedOutputStream(new FileOutputStream(outFile));

int i=0;
Collection<Track1Callable> callables = new ArrayList<Track1Callable>();

// For each entry in the test file, do watever needs to be done.
// Track1Callable actually processes that entry and returns a double value.
for (Pair<PreferenceArray, long[]> tests : new DataFileIterable(
        KDDCupDataModel.getTestFile(dataFileDirectory))) {
    PreferenceArray userTest = tests.getFirst();
    callables.add(new Track1Callable(recommender, userTest));
    i++;
}

ExecutorService executor = Executors.newFixedThreadPool(cores); //24 cores
List<Future<byte[]>> results = executor.invokeAll(callables);
executor.shutdown();

for (Future<byte[]> result : results) {
    for (byte estimate : result.get()) {
        out.write(estimate);
    }
}
out.flush();
out.close();

When I receive the result from each callable, output it to a file. Does this output in the exact order as the list of initial Callables was made? In spite of some completing before others? Seems it should but not sure.

Also, I expect a total of 6.2 million bytes to be written to the outfile. But I get an additional 2000 bytes (Yeah for free). That messes up my submission and I think it is because of some concurrency issues. I tested this on small dataset and it seems to work fine there (264 bytes expected and received).

Anyhing wrong I am doing with the Executor framework or Futures?

It shouldn't make any difference, but you can write a `byte[]` without looping over all of its bytes: `out.write(result.get(), 0, result.get().length)`. — Tim Sylvester, Apr 06 '11 at 00:45
If you know the size of each result before (e.g., if its always the same) you could add a test to check that. Just to make sure there is no issue with your other code that returns wrongly sized results. — subsub, Apr 07 '11 at 08:58

score 0 · Accepted Answer · answered Apr 06 '11 at 00:31

0

Q: Does the order is the same as the one specified for the tasks? Yes.

From the API:

Returns: A list of Futures representing the tasks, in the same sequential order as produced by the iterator for the given task list. If the operation did not time out, each task will have completed. If it did time out, some of these tasks will not have completed.

As for the "extra" bytes: have you tried doing all of this in sequential order (i.e., without using an executor) and checking if you obtain different results? It seems that your problem is outside the code provided (and probably is not due to concurrency).

answered Apr 06 '11 at 00:31

akappa

10,220
3
39
56

Thanks a lot. I have actually tried the sequential version, which works great. No issues there except it takes around 5 days to run. This is the only change I made since then. – dreamer13134 Apr 06 '11 at 00:35
1

That's strange. Are you sure that all those Callable doesn't interfere one with any other (i.e., they concur to get a resource)? – akappa Apr 06 '11 at 00:40
Exactly. They do not. They just work on independent users and there is no shared resource. Okie the Recommender Model is in fact a shared resource, but it is just used to obtain a preference for an {user-item} pair. Read-only. Whats more, its documented to be thread safe. – dreamer13134 Apr 06 '11 at 00:52

score 0 · Answer 2 · answered Apr 06 '11 at 07:13

The order in which the callable's are executed doesn't matter from the code you have here. You write the results in the order you store the futures in the list. Even if they were executed in reverse order, the file should appear the same as your file writing is single threaded.

I suspect your callables are interacting with each other and you get different results depending on the number of core you use. e.g. You might be using SimpleDateFormat.

I suggest you run this twice in the same program with a dataset which completes in a short time. Run it first with only one thread in the thread pool and a second time with 24 threads You should be able to compare the results from both runs with Arrays.equals(byte[], byte[]) and see that you get exactly the same results.

Extra bytes appearing when building file data using multiple threads

2 Answers2