2

I am developing a text analysis program that represents documents as arrays of "feature counts" (e.g., occurrences of a particular token) within some pre-defined feature space. These arrays are stored in an ArrayList after some processing.

I am testing the program on a 64 mb dataset, with 50,000 records. The program worked fine with small data sets, but now it consistently throws a "out of memory" Java heap exception when I start loading the arrays into an ArrayList object (using the .add(double[]) method). Depending on how much memory I allocate to the stack, I will get this exception at the 1000th to 3000th addition to the ArrayList, far short of my 50,000 entries. It became clear to me that I cannot store all this data in RAM and operate on it as usual.

However, I'm not sure what data structures are best suited to allow me to access and perform calculations on the entire dataset when only part of it can be loaded into RAM?

I was thinking that serializing the data to disk and storing the locations in a hashmap in RAM would be useful. However, I have also seen discussions on caching and buffered processing.

I'm 100% sure this is a common CS problem, so I'm sure there are several clever ways that this has been addressed. Any pointers would be appreciated :-)

Community
  • 1
  • 1
  • Use a datadase which would either allow you to query a small subset or perform some other analysis (like running functions or joins) and which supports cursors (so it's not loading the entire set into memory) or some other memory/mapped cache – MadProgrammer Sep 29 '15 at 03:37
  • @MadProgrammer thanks...what about memory mapped IO files? –  Sep 29 '15 at 03:39
  • Not having used memory mapped files (directly) before, it's difficult to say. My basic concern would be if the data would need to be `Serializable` – MadProgrammer Sep 29 '15 at 03:45
  • "I start loading the arrays into an ArrayList object" Are you trying to get a List, or a List? If latter, how are you going ot be saved by sparse arrays? You're not. – alamar Sep 29 '15 at 03:49
  • @alamar no, I meant List so the fewer elements I need to store, the better. –  Sep 29 '15 at 03:50
  • I can't imagine why you would see OutOfMemoryError when adding an array to list of arrays. How much arrays have you got there? – alamar Sep 29 '15 at 05:11
  • @alamar 50000 objects each holding a 151000 element array of doubles –  Sep 29 '15 at 11:43

2 Answers2

2

You have plenty of choices:

  • Increase heap size (-Xmx) to several gigabytes.
  • Do not use boxing collections, use fastutil - that should decrease your memory use 4x. http://fastutil.di.unimi.it/
  • Process your data in batches or sequentially - do not keep whole dataset in memory simultaneously.
  • Use a proper database. There are even intraprocess databases like HSQL, your mileage may vary.
  • Process your data via map-reduce, perhaps something local like pig.
alamar
  • 18,729
  • 4
  • 64
  • 97
  • Thanks! I tried increasing heap size...not going to work. I am considering using sparse arrays to decrease storage and Ill look into fastutil. Three was my next choice. I like the idea of an intraprocess DB...will consider that too. MapReduce may be overkill, but its an option too. –  Sep 29 '15 at 03:45
0

How about using Apache Spark (Great for in-memory cluster computing) ?This would help scale your infrastructure as your data set gets Larger.

Clyde D'Cruz
  • 1,915
  • 1
  • 14
  • 36
  • Hmmm...would this be overkill for a dataset that will likely never get to 10 TB? –  Sep 29 '15 at 03:51