I am trying to read a 1,000,000 lines CSV file in Java. I'm using OpenCSV library, and it works fine on a smaller file of 30,000 lines. Processes it in under half a second. But when I try to read from a million line file, it never finishes.
Now I tested to see, when it will actually stop, and by using my own version of binary search, I first tried to read 500k lines, then 250k, and so on, and I found that it easily reads 145k lines, in 0.5-0.7sec, while 150k does not even finish.
I have searched SO thoroughly, found several solutions which I employed in my code, such as using BufferedReader
, BufferedInputStream
etc, but none of those solved it. Still it fails between 145-150k lines.
This is the relevant portion of my code (swapping 150000 with 145000 is what causes the program to execute in <1 sec):
try {
// BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream("myFile.csv"));
CSVReader csvReader = new CSVReader(new InputStreamReader
(new BufferedInputStream(new FileInputStream("myFile.csv"), 8192 * 32)));
try {
int count = 0;
String[] line;
long timeStart = System.nanoTime();
while((line = csvReader.readNext()) != null){
count ++;
if(count >= 150000){
break;
}
}
long timeEnd = System.nanoTime();
System.out.println("Count: " + count);
System.out.println("Time: " + (timeEnd - timeStart) * 1.0 / 1000000000 + " sec");
} catch (IOException e) {
e.printStackTrace();
}
} catch (FileNotFoundException e) {
System.out.println("File not found");
}
As you can see, I tried setting a bigger buffer size as well. I tried various combinations of Readers
, Input Streams
etc and nothing really made a difference.
I'm wondering how I can do this? Is there a way to read, say 100k lines at a time, and then continue to read next 100k?
Also, I'm open to any other solution that does not necessarily include the OpenCSV
library. I just used that one for its simplicity to parse a csv file.