9

I am working on a program that has about 400 input files and about 40 output files. It's simple: It reads each input file and it generates a new file with but much bigger(based on a algorithm).

I'm using read() method from BufferedReader:

String encoding ="ISO-8859-1";
FileInputStream fis = new FileInputStream(nextFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis, encoding));
char[] buffer = new char[8192] ;

To read the input files I'm using this:

private String getNextBlock() throws IOException{
    boolean isNewFile = false;

    int n = reader.read(buffer, 0, buffer.length);
    if(n == -1) {
        return null;
    } else {
        return new String(buffer,0,n);
    }
}

With each block I'm doing some checkings (like looking some string inside the block) and then I'm writing it into a file:

BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(
        new FileOutputStream("fileName"), encoding));

writer.write(textToWrite);

The problem is that it takes about 12 minutes. I'm trying to find something else much faster. Anyone have some idea about something better?

Thanks.

Simulant
  • 19,190
  • 8
  • 63
  • 98
CC.
  • 2,736
  • 11
  • 53
  • 70
  • have you tried benchmarking different buffer sizes? – netbrain May 02 '11 at 08:13
  • 1
    Is the bottleneck in the file IO or in the algorithm you're using to combine the data? – scaganoff May 02 '11 at 08:16
  • 1
    @CC if my answer doesn't give you any speed improvements, you could always try to threadpool the read operation. Doing simultaneous reads could increase performance (but could also degrade) – netbrain May 02 '11 at 08:17
  • What is size of files? What is speed of HDD? – ilalex May 02 '11 at 08:20
  • If you are reading/writing to a local drive (a network drive would be much slower) to take 12 minutes (say 6 is reading) the files would have to be about 10 GB is size for read and for writes or about 25 MB reads and 250 MB writes on average. Does this sound right? If it is your disk is your limit. If not, then I/O is not your bottle neck. – Peter Lawrey May 02 '11 at 08:27
  • Check this question... http://stackoverflow.com/questions/5800361/quickest-way-to-read-text-file-line-by-line-in-java/5800452#5800452 – Aaron May 02 '11 at 09:25
  • Have you considerer doing multithreading? If you have multicore you can speed up the process, but the hdd will be always a bottle neck – anizzomc May 02 '11 at 10:10

3 Answers3

20

You should be able to find a answer here:

http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly

For the best Java read performance, there are four things to remember:

  • Minimize I/O operations by reading an array at a time, not a byte at a time. An 8Kbyte array is a good size.

  • Minimize method calls by getting data an array at a time, not a byte at a time. Use array indexing to get at bytes in the array.

  • Minimize thread synchronization locks if you don't need thread safety. Either make fewer method calls to a thread-safe class, or use a non-thread-safe class like FileChannel and MappedByteBuffer.

  • Minimize data copying between the JVM/OS, internal buffers, and application arrays. Use FileChannel with memory mapping, or a direct or wrapped array ByteBuffer.

netbrain
  • 9,194
  • 6
  • 42
  • 68
5

As you do not give too much details, I could sugest you to try to use use memory mapped files:

FileInputStream f = new FileInputStream(fileName);
FileChannel ch = f.getChannel( );
MappedByteBuffer mbb = ch.map( ch.MapMode.READ_ONLY, 0L, ch.size( ) );
while ( mbb.hasRemaining( ) )  {
      // Access the data using the mbb
}

It is possible to opitmize it if you'd give more detailt about which kind of data your files have.

EDIT

Where is the // access the date using the mbb, you cold decode your text:

String charsetName = "UTF-16"; // choose the apropriate charset.
CharBuffer cb =  Charsert.forName(charsetName).decode(mbb);
String text = cb.toString();
Pih
  • 2,282
  • 15
  • 20
  • The OP wants to read the file as text. You might like to include how you read MappedByteBuffer with the default encoding (or a specific one like UTF-8) – Peter Lawrey May 02 '11 at 09:09
  • As he reads the mapped file like bytes, no mater the endoding. He will need to specify the encoding when building the String: String s = new String(mbb.array() , Charset.UTF-8), taking care if the array is loaded, if it is not, will be necessary to read using asCharBuffer() and also have to know the size and content of the array. – Pih May 02 '11 at 09:30
  • Ah, but the devil is in the detail. ;) For example, you cannot decode a String where one byte of a character has been read but another has not. ;) I don't believe you can call `mbb.array()` on a MappedByteBuffer – Peter Lawrey May 02 '11 at 09:34
  • 1
    Ideed about the mbb.array, I missed this important detail. He will need to use the Charset.decode method, I will update my answer using it. – Pih May 02 '11 at 09:42
  • +1: Its not simple to get right, so adding an example is useful. – Peter Lawrey May 02 '11 at 09:58
  • Be aware that once a file has been mapped a number of operations on that file will fail until the mapping has been released (e.g. delete, truncating to a size less than the mapped area) but there is currently (until java10?) no way to release the mapping except waiting the GC : http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4724038 – manuc66 May 14 '17 at 08:57
2

Mapped byte buffers is the fastest way:

 FileInputStream f = new FileInputStream( name );
FileChannel ch = f.getChannel( );
MappedByteBuffer mb = ch.map( ch.MapMode.READ_ONLY,
    0L, ch.size( ) );
byte[] barray = new byte[SIZE];
long checkSum = 0L;
int nGet;
while( mb.hasRemaining( ) )
{
    nGet = Math.min( mb.remaining( ), SIZE );
    mb.get( barray, 0, nGet );
    for ( int i=0; i<nGet; i++ )
    checkSum += barray[i];
}