Reading Really big Files With Java

Question

I am reading a 77MB file inside a Servlet, in future this will be 150GB. This file is not written using any kind of nio package thing, it is just written using BufferedWriter.

Now this is what I need to do.

Read the file line by line. Each line is a "hash code" of a text. Separate it into pieces of 3 chars (3 chars represents 1 word) It could be long, it could be short, I don't know.
After reading the line, convert it into real words. We have a Map of words and Hashes so we can find the words.

Up to now, I used BufferedReader to read the file. It is slow and not good for Huge files like 150GB. It took hours to complete the entire process even for this 77MB file. Because we can't keep the user waiting for hours, it should be within seconds. So, we decided to load the file into the memory. First we thought about loadng every single line into a LinkedList, so the memory coulkd save it. But you know, memory cannot save such a big amount. After a Big Search, I decided Mapping Files to the memory would be the answer. Memory is super faster than the Disk, so we could read the files super fast too.

Code:

public class MapRead {

    public MapRead()
    {
        try {
            File file = new File("E:/Amazon HashFile/Hash.txt");
            FileChannel c = new RandomAccessFile(file,"r").getChannel();

            MappedByteBuffer buffer = c.map(FileChannel.MapMode.READ_ONLY, 0,c.size()).load();

            for(int i=0;i<buffer.limit();i++)
            {
                System.out.println((char)buffer.get());
            }

            System.out.println(buffer.isLoaded());
            System.out.println(buffer.capacity());



        } catch (IOException ex) {
            Logger.getLogger(MapRead.class.getName()).log(Level.SEVERE, null, ex);
        }
    }


}

But I could not see any "super fast" thing. And I need line by line. I have few questions to ask.

You read my description and you know what I need to do. I have done the first step for that, so is that correct?
The way I Map is correct? I mean, this is no difference than reading it in normal way. So does this hold the "entire" file in memory first? (lets say using a technique called Mapping) Then we have to write another code to access that memory?
How to read line by line, in super "fast" ? (If I have to load/map the entire file to the memory first for hours, then access it in super speed in seconds, I am totally fine with it too)
Reading files in Servlets is good ? (Because it is being accessed by number of people, and only one IO stream will be opened at once. In this case this servlet will be accessed by thousands at once)

Update

This is how my code look when I updated it with SO user Luiggi Mendoza's answer.

public class BigFileProcessor implements Runnable {
    private final BlockingQueue<String> linesToProcess;
    public BigFileProcessor (BlockingQueue<String> linesToProcess) {
        this.linesToProcess = linesToProcess;
    }
    @Override
    public void run() {
        String line = "";
        try {
            while ( (line = linesToProcess.take()) != null) {

                System.out.println(line); //This is not happening
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}


public class BigFileReader implements Runnable {
    private final String fileName;
    int a = 0;

    private final BlockingQueue<String> linesRead;
    public BigFileReader(String fileName, BlockingQueue<String> linesRead) {
        this.fileName = fileName;
        this.linesRead = linesRead;
    }
    @Override
    public void run() {
        try {

            //Scanner do not work. I had to use BufferedReader
            BufferedReader br = new BufferedReader(new FileReader(new File("E:/Amazon HashFile/Hash.txt")));
            String str = "";

            while((str=br.readLine())!=null)
            {
               // System.out.println(a);
                a++;
            }

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
}



public class BigFileWholeProcessor {
    private static final int NUMBER_OF_THREADS = 2;
    public void processFile(String fileName) {

        BlockingQueue<String> fileContent = new LinkedBlockingQueue<String>();
        BigFileReader bigFileReader = new BigFileReader(fileName, fileContent);
        BigFileProcessor bigFileProcessor = new BigFileProcessor(fileContent);
        ExecutorService es = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
        es.execute(bigFileReader);
        es.execute(bigFileProcessor);
        es.shutdown();
    }
}



public class Main {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        // TODO code application logic here
        BigFileWholeProcessor  b = new BigFileWholeProcessor ();
        b.processFile("E:/Amazon HashFile/Hash.txt");
    }
}

I am trying to print the file in BigFileProcessor. What I understood is this;

User enter file name
That file get read by BigFileReader, line by line
After each line, the BigFileProcessor get called. Which means, assume BigFileReader read the first line. Now the BigFileProcessor is called. Now once the BigFileProcessor completes the processing for that line, now the BigFileReader reads the line 2. Then again the BigFileProcessor get called for that line, and so on.

May be my understanding about this code is incorrect. How should I process the line anyway?

By your actual implementation, you're never inserting the elements in the queue when reading them, so they cannot be taken in the processor. — Luiggi Mendoza, Mar 05 '14 at 15:58
@LuiggiMendoza: oops! Any help please? I have never worked with stuff like this, first time things — PeakGen, Mar 05 '14 at 16:18
That line is inside my code example: `linesRead.put(scanner.nextLine());`. You just have to do the same but using `BufferedReader`. Check the code of `BigFileReader#run` in my example one more time. — Luiggi Mendoza, Mar 05 '14 at 16:25
@LuiggiMendoza: Thank you for the continuous help. I am marking your answer as the answer. I have one more question to ask. This `BlockingQueue` in `BigFileProcessor` store the data in heap right? so we can access it later at anytime right? Will this lead to `OutOfMemory` exception because the future files are in Gigabytes? — PeakGen, Mar 06 '14 at 19:24
Yes, the data is stored in heap. In order to avoid `OutOfMemoryError`, you can: use more threads that read from the queue or halt the reading if the queue has a specific size e.g. 10000 until the queue has a size below e.g. 8000 then continue reading the file and filling the queue. Note that this strategy is also covered in my answer as explanation, not in code. — Luiggi Mendoza, Mar 06 '14 at 19:43

Luiggi Mendoza · Accepted Answer · 2014-04-02T22:58:37.187

6

I would suggest using multi thread here:

One thread will take care to read every line of the file and insert it into a BlockingQueue in order to be processed.
Another thread(s) will take the elements from this queue and process them.

To implement this multi thread work, it would be better using ExecutorService interface and passing Runnable instances, each should implement each task. Remember to have only a single task to read the file.

You could also manage a way to stop reading if the queue has a specific size e.g. if the queue has 10000 elements then wait until its size is down to 8000, then continue reading and filling the queue.

Reading files in Servlets is good ?

I would recommend never do heavy work in servlet. Instead, fire an asynchronous task e.g. via JMS call, then in this external agent you will process your file.

A brief sample of the above explanation to solve the problem:

public class BigFileReader implements Runnable {
    private final String fileName;
    private final BlockingQueue<String> linesRead;
    public BigFileReader(String fileName, BlockingQueue<String> linesRead) {
        this.fileName = fileName;
        this.linesRead = linesRead;
    }
    @Override
    public void run() {
        //since it is a sample, I avoid the manage of how many lines you have read
        //and that stuff, but it should not be complicated to accomplish
        Scanner scanner = new Scanner(new File(fileName));
        while (scanner.hasNext()) {
            try {
                linesRead.put(scanner.nextLine());
            } catch (InterruptedException ie) {
                //handle the exception...
                ie.printStackTrace();
            }
        }
        scanner.close();
    }
}

public class BigFileProcessor implements Runnable {
    private final BlockingQueue<String> linesToProcess;
    public BigFileProcessor (BlockingQueue<String> linesToProcess) {
        this.linesToProcess = linesToProcess;
    }
    @Override
    public void run() {
        String line = "";
        try {
            while ( (line = linesToProcess.take()) != null) {
                //do what you want/need to process this line...
            }
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

public class BigFileWholeProcessor {
    private static final int NUMBER_OF_THREADS = 2;
    public void processFile(String fileName) {
        BlockingQueue<String> fileContent = new LinkedBlockingQueue<String>();
        BigFileReader bigFileReader = new BigFileReader(fileName, fileContent);
        BigFileProcessor bigFileProcessor = new BigFileProcessor(fileContent);
        ExecutorService es = Executors.newFixedThreadPool(NUMBER_OF_THREADS);
        es.execute(bigFileReader);
        es.execute(bigFileProcessor);
        es.shutdown();
    }
}

edited Apr 02 '14 at 22:58

answered Mar 03 '14 at 17:13

Luiggi Mendoza

85,076
16
154
332

Thanks for the reply. But how can I read a line here? I can't see "readLine()" method. And do you mean load everything first into the `BlockingQueue` and do the operations later? – PeakGen Mar 03 '14 at 17:35
@GloryOfSuccess you have [`BufferedReader#readLine`](http://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#readLine()) or [`Scanner#nextLine`](http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html#nextLine()) to read each line of your file, and I mean to do it in parallel using threads. – Luiggi Mendoza Mar 03 '14 at 17:38
@GloryOfSuccess just to note, I've used `Scanner` to parse files of 500 MBs and takes a few seconds. – Luiggi Mendoza Mar 03 '14 at 17:40
@GloryOfSuccess: performance is also dependent on hardware used. Is it a decade-old 7200rpm IDE disk or a current SSD disk? – BalusC Mar 03 '14 at 17:56
@BalusC: Hi, it is an IDE drive. 5 years old. Normal spinning platter. – PeakGen Mar 03 '14 at 18:00
@LuiggiMendoza: Nice. can you just show me the `thread` thing please by editing my code? I did not get this properly, I tried this before in another way, so my mind runs there. – PeakGen Mar 03 '14 at 18:04
@LuiggiMendoza: I tried using `BufferedReader`. We can't load the `MappedByteBuffer` into this right? So, how can I read this? – PeakGen Mar 03 '14 at 18:16
@LuiggiMendoza: Hello? – PeakGen Mar 03 '14 at 18:56
@GloryOfSuccess answer updated with a sample of the exposed design. Note that I don't use any NIO class. – Luiggi Mendoza Mar 03 '14 at 19:59
Thanks a lot. But I noticed there are major compile issues in the code. – PeakGen Mar 05 '14 at 11:55
@GloryOfSuccess thanks, I just wrote the code out of my mind. Fixed the compilation errors. – Luiggi Mendoza Mar 05 '14 at 14:31
Thanks a lot for the reply. err, I tried this just now, this is not what I am exactly seeking for. You know, when the line is read in `BigFileReader` that is the line which should be processed in `BigFileProcessor`. Any idea please? – PeakGen Mar 05 '14 at 14:57
@GloryOfSuccess I don't understand. That line will be inserted in the common queue from the BigFileReader, then it will be read again from the BigFileProcessor (this time from memory, not from disk, thus being faster) and you could process it as you wish. That's what is doing. What's the exact problem? – Luiggi Mendoza Mar 05 '14 at 15:02
Really appreciate your reply. I can't explain here, so I update the post with new code, which includes your code advice. Please have a look. I appreciate your help. – PeakGen Mar 05 '14 at 15:53

score 2 · Answer 2 · answered Mar 03 '14 at 17:40

NIO won't help you here. BufferedReader is not slow. If you're I/O bound, you're I/O bound -- get faster I/O.

Mapping the file in to memory can help, but only if you're actually using the memory in place, rather than just copying all of the data out of the big byte array that you get back. The primary advantage of mapping the file is that it keeps the data out of the java heap, and away from the garbage collector.

Your best performance will come from working on the data in place, and not copying it in to the heap if you can.

Some of your performance may be impacted by the object creation. For example, if you were trying to load your data in to the LinkedList, you're creating (likely) millions of nodes for the List itself, plus the object surrounding your data (even if they're just strings).

Creating Strings based on your memory mapped array can be quite efficient, as the String will simply wrap the data, not copy it. But you'll have to be UTF aware if you're working with something other than ASCII (as bytes are not characters in Java).

Also if you're loading in large things, with lots of objects, ensure that you have free space in your heap for them. And by free space, I mean actual room. You can have a 500MB heap, as specified by -Xmx, but the ACTUAL heap will not be that large initially, it will grow to that limit.

Assuming you have sufficient memory in the first place, you can do this via -Xms, which will pre-allocate the heap to a desired size, or you can simply do a quick byte[] buf = new byte[400 * 1024 * 1024], to make a huge allocation, force the GC, and stretch the heap.

What you don't want to be doing is allocating a million objects and have the VM GC every 10000 or so as it grows. Pre-allocating other data structures is also helpful (notably ArrayLists, LinkedLists not so much).

score 0 · Answer 3 · answered Mar 03 '14 at 17:34

0

Divide the file into smaller parts. For this you'll need have access to seekable read so you can fast-forward to other parts of file.

For each part, spawn multiple worker threads, each with its own copy of the hash lookup table. Let completed threads join a collector thread, which will write completed chunks in order and signal the processing completion.

It will be better to stream file chunks rather than loading all of them in memory.

answered Mar 03 '14 at 17:34

S.D.

29,290
3
79
130

OK, This is all about using threads, avoiding Memory mapping right? – PeakGen Mar 05 '14 at 14:54

Reading Really big Files With Java

Code:

3 Answers3