-1

I have to read a massive csv with about 40.000 entries with dates and values. I did that :

TreeMap<LocalDateTime,Double> fi = new TreeMap<LocalDateTime,Double>();

CSVReader reader = new CSVReader(new FileReader(path),';');

String [] nextLine;

while ((nextLine = reader.readNext()) != null) {
    fi.put(LocalDateTime.parse (nextLine[0],DateTimeFormatter.ofPattern ("uuuu-MM-dd HH:mm")),Double.valueOf(nextLine[1]));
}


reader.close();

Reading from the file is really fast but the parsing into a LocalDateTime is really slow, it takes about 9 minutes to complete. Any idea to do it faster?

Some sample lines from my CSV file:

2015-01-01 15:30;3 
2015-01-01 15:45;5 
2015-01-01 16:00;5 
2015-01-01 16:15;3 
2015-01-01 16:30;4 
2015-01-01 16:45;5 
2015-01-01 17:00;4 
2015-01-01 17:15;3 
2015-01-01 17:30;5 
2015-01-01 17:45;4 
2015-01-01 18:00;4
Ole V.V.
  • 81,772
  • 15
  • 137
  • 161
Aikas
  • 67
  • 2
  • 7

1 Answers1

2

Try reusing the formatter pattern, rather than continually instantiating within the loop. The way you're doing it means that the pattern has to be parsed every iteration:

DateTimeFormatter formatter = DateTimeFormatter.ofPattern ("uuuu-MM-dd HH:mm");
while ((nextLine = reader.readNext()) != null) {
  fi.put(LocalDateTime.parse(nextLine[0],formatter),Double.valueOf(nextLine[1]));
}
Julian Goacher
  • 567
  • 2
  • 6
  • still too slow , same time @Julian – Aikas Jun 10 '17 at 16:50
  • which lib are you using to parse the csv? I'll see if i can reproduce – Julian Goacher Jun 10 '17 at 16:55
  • for Reading the csv i use : http://opencsv.sourceforge.net/ , the Reading from the file is really fast ,is instant i can read and do a system.out,println instantly , but when i have to parse with localdatetime become really slow i think the main problem is there but i dont know how to resolve @Julian – Aikas Jun 10 '17 at 16:58
  • What type of hardware are you running on? 9 minutes seems very slow. I've tried some tests with 40k lines using the same CSV lib. Using your original code, I can parse the file in ~500ms. If I make the change I suggested, that reduces to ~450ms (so not much of a difference). It I instead hand parse the code - using String.substring and the LocalDateTime.as(...) method - then I can reduce that to ~250ms, so about half the original time. But all of these are sub-second times - are you using a very old machine? – Julian Goacher Jun 10 '17 at 17:28
  • I am using a i7 HQ7700 ,8GB RAM DDR4, GTX1050 4GB and 120Gb SSD , i dont understand what is happening :/ , if I only read the file is really fast , but with the LocalDateTime the time is about 9 minutes :S:S @Julian – Aikas Jun 10 '17 at 17:37
  • Can you paste a sample of the data in the CSV file? (just a few lines) – Julian Goacher Jun 10 '17 at 17:39
  • `2015-01-01 15:30;3 2015-01-01 15:45;5 2015-01-01 16:00;5 2015-01-01 16:15;3 2015-01-01 16:30;4 2015-01-01 16:45;5 2015-01-01 17:00;4 2015-01-01 17:15;3 2015-01-01 17:30;5 2015-01-01 17:45;4 2015-01-01 18:00;4 ` – Aikas Jun 10 '17 at 17:41
  • I am sorry i cant write LineBreaks :S , when start the next 2015 is the new line @Julian – Aikas Jun 10 '17 at 17:48
  • It's a strange one @Aikas. I wondered whether the variety of data would have an impact, so I wrote a script to reproduce your type of data set (incrementing date values mapped to small, random integer values) but this doesn't have a large impact on the resulting times. You need to try and isolate what exactly is causing the delays. What happens if you remove the date and number parsing code, and instead just insert the same LocalDateTime and Double value instance into the TreeMap (but keep the CSV parsing code). Does it speed up, or stay the same speed? – Julian Goacher Jun 10 '17 at 17:59
  • i did that : just instert the same LocalDateTime and a Double value into the TreeMap inside the loop and the execution time was instant . @Julian – Aikas Jun 10 '17 at 18:10
  • Ok, so the problem is with parsing and/or the data set; it isn't anything odd about the treemap or the way you're running the code. But there must be something else about your environment that's affecting the execution. Your hardware specs are definitely fine, but is your system low on free memory? Have you tried parsing a subset of the input - does 20k lines take 4.5 minutes? Does 10k lines take 2.25 mins? Does 1k take ~14 seconds? If the relation isn't linear - e.g. is faster for small numbers of lines, but dramatically slows down for higher nos of lines then it could be a mem problem. – Julian Goacher Jun 10 '17 at 18:45
  • i made some tests: with 20k lines takes about 2 minutes , with 10k lines takes about 40 sgs . I dont know is a memory problema because in system administrator I have 4 Gbs free during the execution , and in the eclipse environment only use 600 Mbs of 1024 Mbs :/ @Julian . – Aikas Jun 10 '17 at 18:59
  • Do you know what settings your JVM is running with? (e.g. initial heap size etc.) – Julian Goacher Jun 10 '17 at 19:10
  • i dont understand , in my eclipse.ini i have the following lines : -Xms256m -Xmx1024m and i my eclipse application below say : heap size 1024 M Max @Jajag – Aikas Jun 10 '17 at 19:20
  • Off the top of my head I can't remember where this is configured in eclipse, but i think the settings in the ini file are for the eclipse executable itself, and not for runtimes which are spawned when running a project. You could try running the code directly from the command line (i.e. java -cp . test.ClassName) to see if that makes an difference to run time. Or try running the code on another machine, just to confirm that the problem is with the machine you're testing on. Not sure what more I can suggest without knowing a lot more about your setup! – Julian Goacher Jun 10 '17 at 19:26
  • I change the maximun heap size , and maximum thread number, i double the memory and test with 200 ,100 and 20 threads and dont have any speed up , the time is nearly the same :/ i think that the problem is with the LocalDateTime parsing but i dont know how to resolve it – Aikas Jun 10 '17 at 19:51
  • If you really think the parsing is the issue then try replacing it with this code, it did half the execution time for me: – Julian Goacher Jun 10 '17 at 20:14
  • ` String date = nextLine[0]; int year = Integer.valueOf( date.substring( 0, 4 ) ); int month = Integer.valueOf( date.substring( 5, 7 ) ); int day = Integer.valueOf( date.substring( 8, 10 ) ); int hour = Integer.valueOf( date.substring( 11, 13 ) ); int minute = Integer.valueOf( date.substring( 14, 16 ) ); LocalDateTime localDateTime = LocalDateTime.of( year, month, day, hour, minute );` – Julian Goacher Jun 10 '17 at 20:14
  • ok , i think now that the speed problema is because I add to the treemap one by one because i execute your code without put into the treemap and was really fast but when i execute the put the speed is really really slow @Julian – Aikas Jun 10 '17 at 20:46
  • and if i use string instead LocalDateTime the the execution time es nearly instant – Aikas Jun 10 '17 at 20:57
  • @Aikas is there any particular reason why you use a TreeMap here? What happens if you change it to a HashMap? – Julian Goacher Jun 11 '17 at 18:05
  • well seems that the problem was because of a toString call who show all the content of the HashMap/TreeMap and the console cant handle fast , i remove the print and all work fine , thanks @Julian – Aikas Jun 11 '17 at 19:37