I need write many files in VM. I need write around 300.000 files, today work fine the job for generate files but the time used is 3~4 hours to finish job.
How implement this parallel threads?
I need write many files in VM. I need write around 300.000 files, today work fine the job for generate files but the time used is 3~4 hours to finish job.
How implement this parallel threads?
I have worked out a way you can benefit from multi-threading but for a minimum of changes of your code.
import java.io.*;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
/**
* Created by peter.lawrey on 30/01/15.
*/
public class ConcurrentFileWriter {
private final ThreadPoolExecutor es;
private final int maxQueueSize;
public ConcurrentFileWriter() {
this(4, 10000);
}
public ConcurrentFileWriter(int concurrency, int maxQueueSize) {
this.maxQueueSize = maxQueueSize;
es = (ThreadPoolExecutor) Executors.newFixedThreadPool(concurrency);
}
public OutputStream newFileOutputStream(final String filename) {
return new ByteArrayOutputStream() {
@Override
public void close() throws IOException {
super.close();
final ByteArrayOutputStream baos = this;
if (es.getQueue().size() > maxQueueSize)
try {
Thread.sleep(10);
} catch (InterruptedException e) {
throw new AssertionError(e);
}
es.submit(new Runnable() {
public void run() {
try {
FileOutputStream fos = new FileOutputStream(filename);
fos.write(baos.toByteArray());
fos.close();
} catch (IOException ioe) {
System.err.println("Unable to write to " + filename);
ioe.printStackTrace();
}
}
});
}
};
}
public PrintWriter newPrintWriter(String filename) {
try {
return new PrintWriter(new OutputStreamWriter(newFileOutputStream(filename), "UTF-8"));
} catch (UnsupportedEncodingException e) {
throw new AssertionError(e);
}
}
public void close() {
es.shutdown();
try {
es.awaitTermination(2, TimeUnit.HOURS);
} catch (InterruptedException e) {
e.printStackTrace();
Thread.currentThread().interrupt();
}
}
public static void main(String... args) {
long start = System.nanoTime();
ConcurrentFileWriter cfw = new ConcurrentFileWriter();
int files = 10000;
for (int i = 0; i < files; i++) {
PrintWriter pw = cfw.newPrintWriter("file-" + i);
pw.println("Hello World");
pw.close();
}
long mid = System.nanoTime();
System.out.println("Waiting for files to be written");
cfw.close();
long end = System.nanoTime();
System.out.printf("Took %.3f seconds to generate %,d files and %.3f seconds to write them to disk%n",
(mid - start) / 1e9, files, (end - mid) / 1e9);
}
}
On an SSD, this prints
Waiting for files to be written
Took 0.075 seconds to generate 10,000 files and 0.058 seconds to write them to disk
What this does is allow you to write single threaded code as you do now, however the actual writing to disk is done as a back ground task.
Note: you have to call close() to wait for the files to be actually written to disk.
The problem with writing a huge number of files is this is a lot of work for a HDD. Using multiple threads won't make your drive spin any fasters. Each time you open and close a file it uses about 2 IO (IO operations) If you have a HDD and it supports 80 IOPS (IOs Per Second) you can open and close 40 files per second. i.e. About 2 hours for 300,000 files.
By comparison, if you use an SSD, you can get 80,000 IOPS which is 1000x faster and you might only spend 8 seconds opening and closing files.
Once you have switched to SSD, it may be worth using multiple threads. A simple way to do this is to use the Stream API in Java 8.
You can do something like this
IntStream.range(0, 300000).parallel().
.forEach(i -> createFile(i));
You need to use to have one thread that feeds the files to be processed to a queue and a thread pool that dequeues from the queue and write the files. One way to do that is to use a simple producer consumer.
Here is an example Multithreaded producer consumer in java