5

I am trying to read a 1,000,000 lines CSV file in Java. I'm using OpenCSV library, and it works fine on a smaller file of 30,000 lines. Processes it in under half a second. But when I try to read from a million line file, it never finishes.

Now I tested to see, when it will actually stop, and by using my own version of binary search, I first tried to read 500k lines, then 250k, and so on, and I found that it easily reads 145k lines, in 0.5-0.7sec, while 150k does not even finish.

I have searched SO thoroughly, found several solutions which I employed in my code, such as using BufferedReader, BufferedInputStream etc, but none of those solved it. Still it fails between 145-150k lines.

This is the relevant portion of my code (swapping 150000 with 145000 is what causes the program to execute in <1 sec):

try {
       // BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream("myFile.csv"));
        CSVReader csvReader = new CSVReader(new InputStreamReader
                (new BufferedInputStream(new FileInputStream("myFile.csv"), 8192 * 32)));
        try {
            int count = 0;
            String[] line;
            long timeStart = System.nanoTime();
            while((line = csvReader.readNext()) != null){
                count ++;
                if(count >= 150000){
                    break;
                }
            }
            long timeEnd = System.nanoTime();
            System.out.println("Count: " + count);
            System.out.println("Time: " + (timeEnd - timeStart) * 1.0 / 1000000000 + " sec");
        } catch (IOException e) {
            e.printStackTrace();
        }
    } catch (FileNotFoundException e) {
        System.out.println("File not found");
    }

As you can see, I tried setting a bigger buffer size as well. I tried various combinations of Readers, Input Streams etc and nothing really made a difference.

I'm wondering how I can do this? Is there a way to read, say 100k lines at a time, and then continue to read next 100k?

Also, I'm open to any other solution that does not necessarily include the OpenCSV library. I just used that one for its simplicity to parse a csv file.

P.J.Meisch
  • 18,013
  • 6
  • 50
  • 66
Damir Pasic
  • 51
  • 1
  • 2
  • 1
    When you say 'never finishes'.... What's it actually up to? Deadlock, out of memory etc... Use a debugger to see what its up to, or put some sysouts in the loop to see if it is still processing, but just very slowly?? – Adam May 06 '17 at 13:10
  • 1
    Similar thing happened to me too, but later realized that it is sometimes getting stuck due to not enough heap space. Once I increased the java heap space, it went really fast. – A.D Aug 12 '21 at 02:31

3 Answers3

3

I just had a look at the OpenCSV implementation, I can see nothing there that would explain such behaviour just because the file is large and contains lot of records.

But OpenCSV is able to process multiline data, from the website:

Handling quoted entries with embedded carriage returns (ie entries that span multiple lines).

I think that in your case, there is a record - somewhere the 150k-th record - that contains a wrong quoted entry. The default quote char is ". This might be a record like:

value,value,"badvalue,value
value,value,value,value

In this case, the parser that is used by OpenCSV ist set to a pending state, meaning that the record to read continues on the next line. And the call to CSVReader.readNext() tries to read as many lines as are necessary to finish a csv record. If there no is no match for a misplaced quote character, it will read and read and read until buffers are exhausted or some other error occurs.

To find the record you could read the file as you do, count the records and print out the current count. This will give you the number of the last valid record and then will stop/hang as it does now.

Then I would write a new program that just reads the file line by line (not using CSVParser, just the plain lines) and skip the number of lines you know that are good. Then print about 10 lines from there and you have some data to analyse.

P.J.Meisch
  • 18,013
  • 6
  • 50
  • 66
  • I concur with P.J. and Marat the issue is with the data. If you want to keep using the same program as above I would consider just narrowing down the number (145K works so use 147K, then 148K, then so on) in a binary fashion to narrow down when it starts taking forever. Then you can look at that line in the real file (and the line above/below) to see where the data starts to get malformed. – Scott Conway May 08 '17 at 20:16
2

Maybe the problem is not about lines count in CSV file, but about it's content. Maybe there is some data in lines between 145k and 150k, which cause your application to never finish.

You can check it if you copy first 145k lines from your file and paste them into new CSV file, until it have 1m lines. If your application can process this new file, then problem is in data, not in lines count.

Marat Safin
  • 1,793
  • 20
  • 28
1

The problem was discussed in the answer by P.J. Meisch. However, I found a good solution that was not brought up.

Use the withIgnoreQuotations method when building a parser to get around the problem with quotes.

The following example uses the CsvToBeanBuilder to parse a CSV file (located at filepath) into a list of beans. The values in the are separated by tabs ('\t') and the first line is a header line and therefore skipped (so as not to try to parse it into a bean instance).

List<Bean> beans = new CsvToBeanBuilder<Bean>(new FileReader(filepath))
    .withIgnoreQuotations(true)
    .withSeparator('\t')
    .withSkipLines(1)
    .withType(Bean.class)
    .build()
    .parse();

This example parses the save CSV file line by line into the same bean type. Note, the bean has some number n properties.

List<Bean> beans = new ArrayList<Bean>();
Path path = Paths.get(filepath);

String[] line;

CSVParser parser = new CSVParserBuilder()
    .withSeparator('\t')
    .withIgnoreQuotations(true)
    .build();

CSVReader reader = new CSVReaderBuilder(Files.newBufferedReader(path))
    .withCSVParser(parser)
    .build();

try {
    reader.readNext();

    while ((line = reader.readNext()) != null) {
        Bean bean = new Bean();
        bean.setValue1(line[0]);
        bean.setValue2(line[1]);
        ...
        bean.setValueN(line[n]);
    }
} catch (CsvValidationException | IOException e1) {
    e1.printStackTrace();
} catch (CsvDataTypeMismatchException | CsvConstraintViolationException e) {
    e.printStackTrace();
}

Jason
  • 2,382
  • 1
  • 16
  • 29