Use superCSV to read a large text file of 80GB

Question

I want to read a huge csv file. We are using superCSV to parse through the files in general. In this particular scenario, the file is huge and there is always this problem of running out of memory for obvious reasons.

The initial idea is to read the file as chunks, but I am not sure if this would work with superCSV because when I chunk the file, only the first chunk has the header values and will be loaded into the CSV bean, while the other chunks do not have header values and I feel that it might throw an exception. So

a) I was wondering if my thought process is right
b) Are there any other ways to approach this problem.

So my main question is

Does superCSV have the capability to handle large csv files and I see that superCSV reads the document through the BufferedReader. But I dont know what is the size of the buffer and can we change it as per our requirement ?

@Gilbert Le BlancI have tried splitting into smaller chunks as per your suggestion but it is taking a long time to break down the huge file into smaller chunks. Here is the code that I have written to do it.

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.LineNumberReader;

public class TestFileSplit {

public static void main(String[] args) {

    LineNumberReader lnr = null;
    try {
        //RandomAccessFile input = new RandomAccessFile("", "r");
        File file = new File("C:\\Blah\\largetextfile.txt");
        lnr = new LineNumberReader(new FileReader(file), 1024);
        String line = "";
        String header = null;
        int noOfLines = 100000;
        int i = 1;
        boolean chunkedFiles = new File("C:\\Blah\\chunks").mkdir();
        if(chunkedFiles){
            while((line = lnr.readLine()) != null) {
                if(lnr.getLineNumber() == 1) {
                    header = line;
                    continue;
                }
                else {
                    // a new chunk file is created for every 100000 records
                    if((lnr.getLineNumber()%noOfLines)==0){
                        i = i+1;
                    }

                    File chunkedFile = new File("C:\\Blah\\chunks\\" + file.getName().substring(0,file.getName().indexOf(".")) + "_" + i + ".txt");

                    // if the file does not exist create it and add the header as the first row
                    if (!chunkedFile.exists()) {
                        file.createNewFile();
                        FileWriter fw = new FileWriter(chunkedFile.getAbsoluteFile(), true);
                        BufferedWriter bw = new BufferedWriter(fw);
                        bw.write(header);
                        bw.newLine();
                        bw.close();
                        fw.close();
                    }

                    FileWriter fw = new FileWriter(chunkedFile.getAbsoluteFile(), true);
                    BufferedWriter bw = new BufferedWriter(fw);
                    bw.write(line);
                    bw.newLine();
                    bw.close();
                    fw.close();
                }
            }
        }
        lnr.close();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
    }
}
}

Make the first chunk the header values row. Then you can concatenate the first chunk with however many other chunks you need to make the huge file small enough, processing one chunk at a time. — Gilbert Le Blanc, Sep 28 '12 at 19:20
I'm curious what's causing this - what are you doing with bean once it's been read? If you're adding to a List then you'll likely run out of memory. Is it possible to change your application architecture to process each bean as you read them - or to process them in small groups? — James Bassett, Sep 28 '12 at 23:52

score 2 · Answer 1 · edited Sep 24 '13 at 08:33

You can define header in the parser java class itself. That way you don't need a header row in CSV files.

// only map the first 3 columns - setting header elements to null means those columns are ignored
final String[] header = new String[] { "customerNo", "firstName", "lastName", null, null, null, null, null, null, null };
beanReader.read(CustomerBean.class, header)

or

You can also use dozer extension of SuperCSV api.

score 1 · Answer 2 · answered Oct 01 '12 at 00:54

1

I'm not sure what the question is. Reading a line at a time as a bean takes roughly constant memory consumption. If you store all read objects at once then Yes you run out of memory. But how is this super csv's fault ?

answered Oct 01 '12 at 00:54

Carlo V. Dango

13,322
16
71
114

Yes, that is not an issue. The way we read from a csv through super csv is through the file reader. Now that the file is really big, I am running into outofmemory issues. I am not saying it is supercsv's fault. May be I was not clear earlier. Here is the reframed question. I have a csv file of 180 GB and I am pretty sure if I try to load it and feed into SuperCSV I will get an out of memory exception. Therefor as Gilbert sais I am trying to chunk it out into smaller files and then read, but I am not sure how do i go about with that, like how do i chunk a file into exactly 2GB each – Nikhil Das Nomula Oct 02 '12 at 14:06
No need to split the file up into smaller chunks. Here's what I would do. First, I'd read the header line and save it to a String[]. Then I'd read the file X bytes or X lines at a time, where X is an ideal size given your memory constraints. Then, for each chunk X, which is represented as a String, create a StringReader which you pass into a CsvReader constructor. Then continue parsing until the CsvReader's read() method returns null. After that, read the next X from the file, and continue the above until you're finished. – Aquarelle Aug 09 '13 at 00:55

Use superCSV to read a large text file of 80GB

2 Answers2