3

I need write many files in VM. I need write around 300.000 files, today work fine the job for generate files but the time used is 3~4 hours to finish job.

How implement this parallel threads?

Cœur
  • 37,241
  • 25
  • 195
  • 267
danillonc
  • 49
  • 1
  • 7
  • You need to post some code. Most file I/O can be speeded up enormously just by using a `BufferedOutputStream.` – user207421 Feb 04 '15 at 22:00

2 Answers2

6

I have worked out a way you can benefit from multi-threading but for a minimum of changes of your code.

import java.io.*;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;

/**
 * Created by peter.lawrey on 30/01/15.
 */
public class ConcurrentFileWriter {
    private final ThreadPoolExecutor es;
    private final int maxQueueSize;

    public ConcurrentFileWriter() {
        this(4, 10000);
    }

    public ConcurrentFileWriter(int concurrency, int maxQueueSize) {
        this.maxQueueSize = maxQueueSize;
        es = (ThreadPoolExecutor) Executors.newFixedThreadPool(concurrency);
    }

    public OutputStream newFileOutputStream(final String filename) {
        return new ByteArrayOutputStream() {
            @Override
            public void close() throws IOException {
                super.close();
                final ByteArrayOutputStream baos = this;
                if (es.getQueue().size() > maxQueueSize)
                    try {
                        Thread.sleep(10);
                    } catch (InterruptedException e) {
                        throw new AssertionError(e);
                    }
                es.submit(new Runnable() {
                    public void run() {
                        try {
                            FileOutputStream fos = new FileOutputStream(filename);
                            fos.write(baos.toByteArray());
                            fos.close();
                        } catch (IOException ioe) {
                            System.err.println("Unable to write to " + filename);
                            ioe.printStackTrace();
                        }
                    }
                });
            }
        };
    }

    public PrintWriter newPrintWriter(String filename) {
        try {
            return new PrintWriter(new OutputStreamWriter(newFileOutputStream(filename), "UTF-8"));
        } catch (UnsupportedEncodingException e) {
            throw new AssertionError(e);
        }
    }

    public void close() {
        es.shutdown();
        try {
            es.awaitTermination(2, TimeUnit.HOURS);
        } catch (InterruptedException e) {
            e.printStackTrace();
            Thread.currentThread().interrupt();
        }
    }

    public static void main(String... args) {
        long start = System.nanoTime();
        ConcurrentFileWriter cfw = new ConcurrentFileWriter();
        int files = 10000;
        for (int i = 0; i < files; i++) {
            PrintWriter pw = cfw.newPrintWriter("file-" + i);
            pw.println("Hello World");
            pw.close();
        }
        long mid = System.nanoTime();
        System.out.println("Waiting for files to be written");
        cfw.close();
        long end = System.nanoTime();
        System.out.printf("Took %.3f seconds to generate %,d files and %.3f seconds to write them to disk%n",
                (mid - start) / 1e9, files, (end - mid) / 1e9);
    }
}

On an SSD, this prints

Waiting for files to be written
Took 0.075 seconds to generate 10,000 files and 0.058 seconds to write them to disk

What this does is allow you to write single threaded code as you do now, however the actual writing to disk is done as a back ground task.

Note: you have to call close() to wait for the files to be actually written to disk.


The problem with writing a huge number of files is this is a lot of work for a HDD. Using multiple threads won't make your drive spin any fasters. Each time you open and close a file it uses about 2 IO (IO operations) If you have a HDD and it supports 80 IOPS (IOs Per Second) you can open and close 40 files per second. i.e. About 2 hours for 300,000 files.

By comparison, if you use an SSD, you can get 80,000 IOPS which is 1000x faster and you might only spend 8 seconds opening and closing files.

Once you have switched to SSD, it may be worth using multiple threads. A simple way to do this is to use the Stream API in Java 8.

You can do something like this

IntStream.range(0, 300000).parallel().
         .forEach(i -> createFile(i));
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Thank you explanation about I/O operations in disk but I use java 6. This solution not solve my problem. Currently I use recursive operations which is bad solution. I have googled about ThreadPool set a number of threds to execute parallel, but the examples aren't clear. – danillonc Jan 30 '15 at 13:37
  • @danillonc Given that Java 7 is about to be End of Public updates, I suggest upgrading to Java 8. Learning how to do this in Java 6 will be harder (though multi-threading is unlikely to help with an HDD anyway) – Peter Lawrey Jan 30 '15 at 15:25
  • I can't upgrade my java version because isn't my project. This project is from enterprise. – danillonc Jan 30 '15 at 15:49
  • @danillonc So I assuming you can't upgrade the drive either ;) – Peter Lawrey Jan 30 '15 at 15:55
  • Yeah. I can't upgrade anyone – danillonc Jan 30 '15 at 16:32
  • @danillonc could you write a zip of the files you need? Without reducing the number of files actually written, there isn't much point in changing it. – Peter Lawrey Jan 30 '15 at 16:35
  • The main problem is the time to write files. I need write all files and reducing time once. – danillonc Jan 30 '15 at 16:44
  • @danillonc in which case, you probably need a faster drive (or may be not, I am just guessing you have HDD) Can you check what your disk sub-system is? – Peter Lawrey Jan 30 '15 at 16:47
  • @danillonc I think I have an answer for you which would allow you to see if multi-threading your disk access will help with a minimum of changes to your code. – Peter Lawrey Jan 30 '15 at 17:08
  • Understood but really I need change my code. The upgrading from disk driver not is possible. – danillonc Jan 30 '15 at 17:18
  • 1
    Thanks for your solution. Good demonstration very helpful. – danillonc Jan 30 '15 at 17:40
  • @danillonc The main problems that should be noted with this solution are that (1) IOExceptions never propagate back to the calling code, and (2) it assumes that the file contents and indeed possibly all the contents of the 300,000 files fit into memory at once. – user207421 Feb 04 '15 at 22:03
0

You need to use to have one thread that feeds the files to be processed to a queue and a thread pool that dequeues from the queue and write the files. One way to do that is to use a simple producer consumer.

Here is an example Multithreaded producer consumer in java

Anand Rajasekar
  • 827
  • 8
  • 5