0

Suppose you have this .csv that we'll name "toComplete":

[Date,stock1, stock2, ...., stockn]
[30-jun-2015,"NA", "NA", ...., "NA"]
....
[30-Jun-1994,"NA","NA",....,"NA"]

with n = 1000 and number of row = 5000. Each row is for a different date. That's kind of a big file and I'm not really used to it. My goal is to fill the "NA" by values I'll take into other .csv. In fact, I have 1 file (still a .csv) for each stock. This means I have 1000 files for my stock and my file "toComplete".

Here are what looks like the files stock :

[Date, value1, value2]
[27-Jun-2015, v1, v2]
....
[14-Fev-2013,z1,z2]

They are less date in each stock's file than in the "toComplete" file and each date in stock's file is necessarily in "toComplete"'s file.

My question is : What is the best way to fill my file "toComplete" ? I tried by reading it line by line but this is very slow. I've been reading "toComplete" line by line and every each line I'm reading the 1000 stock's file to complete my file "toComplete". I think there are better solutions but I can't see them.

EDIT : For example, to replace the "NA" from the second row and second column from "toComplete", I need to call my file stock1, read it line by line to find the value from value1 corresponding to the date of second row in "toCompelte". I hope it makes more sense now.

EDIT2 : Dates are edited. For a lot of stocks, I won't have values. In this example, we only have dates from 14-Fev-2013 to 27-Jun-2015, which means that there will stay some "NA" at the end (but it's not a problem). I know in which files to search because my files are named stock1.csv, stock2.csv, ... I put them in a unique directory so I can use .list() method.

zardlemalefique
  • 77
  • 1
  • 11

1 Answers1

2

So you have 1000 "price history" CSV files for certain stocks containing up to 5000 days of price history each, and you want to combine the data from those files into one CSV file where each line starts with a date and the rest of the entries on the line are the up to 1000 different stock prices for that historical day? -back of the napkin calculations indicate the final file would likely contain less than 1 MB of data (less than 20 bytes per stock price would mean less than 20kb per line * 5k lines). There should be plenty of RAM in a 256/512MB JVM to read the data you want to keep from those 1000 files into a Map where the keys are the dates and the value for each key is another Map with 1000 stock symbol keys and 1000 stock value values. Then write out your final file by iterating the Map(s).

geneSummons
  • 907
  • 5
  • 15
  • I didn't even knew it was possible to put a mal as a value for a specific key. Thank you very much, I'm gonna do some research and try it ! – zardlemalefique Jan 29 '16 at 00:33
  • `Map> = new HashMap>();` or `Map> = new HashMap>();` – geneSummons Jan 29 '16 at 00:34
  • @geneSummons: It's better to edit your answer to include the example code than to put it in a comment. – Eric J. Jan 29 '16 at 00:55
  • @EricJ: I will remember that next time, it's past 5 minutes now. I was responding to OP's comment with another comment. – geneSummons Jan 29 '16 at 01:14
  • Okay so I created a HashSet to stock all my dates and I convert them from String to LocalDate and sorted them thanks to Collections.sort(). Now I should create my maps but I don't know how to proceed. Is it better to created 1000 maps (one for each stocks'file) or not ? @geneSummons – zardlemalefique Jan 30 '16 at 00:08
  • There can only be one value associated with any given key. So you can either have one Map (with up to 1000 keys and values) as that value, or one List (with up to 1000 Maps) as that value. I think you will have the memory to spare either way, but my suspicion is that it is more efficient to use one Map (with up to 1000 keys and values) as the value associated with your "date" keys. – geneSummons Feb 01 '16 at 22:38