2

I want to read a log file in different chunks to make it multi threaded. The application is going to run in a serverside environment with multiple hard disks. After reading into chunks the app is going to process line per line of every chunk.

I've accomplished the reading of every file line line with a bufferedreader and I can make chunks of my file with RandomAccessFile in combination with MappedByteBuffer, but combining these two isn't easy.

The problem is that the chunk is just cutting into the last line of my chunk. I never have the whole last line of my block so processing this last log-line is impossible. I'm trying to find a way to cut my file into variable-length chunks respecting the end of the lines.

Does anyone have a code for doing this?

Yoni
  • 325
  • 2
  • 7
  • 15
  • 1
    It seems very unlikely indeed that reading a single file in multiple threads will be faster than reading it in a single thread. Disks are very good at sequential access, less so at random access. If the bottleneck is in processing rather than IO (again, seems unlikely), then read all the data in one thread, and hand blocks off to worker threads to be processed. I would suggest you limit the parallelism to processing multiple files at once, each with a single thread. – Tom Anderson Apr 01 '11 at 10:03

2 Answers2

9

You could find offsets in the file that are at line boundaries before you start processing the chunks. Start with the offset by dividing the file size by the chunk number and seek until you find a line boundary. Then feed those offsets into your multi-threaded file processor. Here's a complete example that uses the number of available processors for the number of chunks:

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class ReadFileByChunks {
    public static void main(String[] args) throws IOException {
        int chunks = Runtime.getRuntime().availableProcessors();
        long[] offsets = new long[chunks];
        File file = new File("your.file");

        // determine line boundaries for number of chunks
        RandomAccessFile raf = new RandomAccessFile(file, "r");
        for (int i = 1; i < chunks; i++) {
            raf.seek(i * file.length() / chunks);

            while (true) {
                int read = raf.read();
                if (read == '\n' || read == -1) {
                    break;
                }
            }

            offsets[i] = raf.getFilePointer();
        }
        raf.close();

        // process each chunk using a thread for each one
        ExecutorService service = Executors.newFixedThreadPool(chunks);
        for (int i = 0; i < chunks; i++) {
            long start = offsets[i];
            long end = i < chunks - 1 ? offsets[i + 1] : file.length();
            service.execute(new FileProcessor(file, start, end));
        }
        service.shutdown();
    }

    static class FileProcessor implements Runnable {
        private final File file;
        private final long start;
        private final long end;

        public FileProcessor(File file, long start, long end) {
            this.file = file;
            this.start = start;
            this.end = end;
        }

        public void run() {
            try {
                RandomAccessFile raf = new RandomAccessFile(file, "r");
                raf.seek(start);

                while (raf.getFilePointer() < end) {
                    String line = raf.readLine();
                    if (line == null) {
                        continue;
                    }

                    // do what you need per line here
                    System.out.println(line);
                }

                raf.close();
            } catch (IOException e) {
                // deal with exception
            }
        }
    }
}
WhiteFang34
  • 70,765
  • 18
  • 106
  • 111
  • No problem. You might want to accept the answer then :) It'll help when you ask further questions, people like to see that you accept answers. – WhiteFang34 Apr 05 '11 at 04:08
  • @WhiteFang34 - Awesome, i have just quick question, if i want to chunk file without cutting word in next file. What could be the best approach? i can chunk but part of words(letters) goes into next chunk. I wanted to get rid of this situation. – parrotjack Nov 16 '19 at 21:16
0

You need to let your chunks overlap. If no lines are longer than a block, then a one block overlap is enough. Are you sure you need a multithreaded version? Is the performance of gnu grep not good enough?

The implementation of gnu grep has solved the problem with lines that cross the chunk border. If you aren't bothered with the GNU License you can probably borrow ideas and code from there. It is a very efficient single-threaded implementation.

Klas Lindbäck
  • 33,105
  • 5
  • 57
  • 82
  • I'm assigned to this project and it has to be multithreaded because there will be multiple files (>500mb) readed on a large scale and everything has to be as fast as possible. – Yoni Apr 01 '11 at 09:44
  • 1
    Can't you just give one file to each thread? That way the threads don't have to know about each other. If the servers are Linuc/unix my first approach would be to spawn a gnu grep command for each file because gnu grep is one of the fastest way to search files. – Klas Lindbäck Apr 01 '11 at 12:35