1

At the moment I am dealing with large amounts of float/double datasets to be used for calculation. I have a set of files to compare Data A to Data B and I would like to compute the Euclidean distance / Cosine similarity. I.E. Data A point 1 iterates through Data B Points to find the nearest neighbour.

The data is given in a text file - no issues with that. What would be an ideal way to go about storing/reading the information?

I would have to repeat Data B for all points in Data A. The data is to be stored as floats. Each Data point may have dimensions. A file may contain up to about 2mil floats.

Should I go about using :

  1. Constantly reading Data B's file and parsing the string (I feel that this is highly inefficient)
  2. Storing the data in a List (An array of floats)
  3. Using a Memory-Map IO?
  4. HashMap (I am relatively new to HashMap, they say that the positions of the collection may change over time, if i am just iterating through with no modifications, will the positions change?)
Matt
  • 74,352
  • 26
  • 153
  • 180
natchan
  • 138
  • 1
  • 1
  • 12
  • 1
    I don't understand why a simple `float[][]` array doesn't work here. – Louis Wasserman Feb 16 '12 at 08:21
  • You seem to be better at math than I am, so just try to estimate the required memory if you store the floats in an array: a float is 4 bytes, and you have 2 millions of them. That makes 8 million bytes: 8 MBs. Peanuts to store in memory. Even if the data structure is more memory hungry, and you multiply the memory needed per float by 10, it still makes only 80 MBs. Still peanuts. – JB Nizet Feb 16 '12 at 08:26
  • Oh i forgot to add that points in a data set may be missing thus making the set incomplete. Thus i either have to 1) Scan through the file to find the max dimensions and classes 2) Use lists. Actually which do you think would have less overhead, scanning through the file once before creating a confirmed and defined 2D array OR using a list? – natchan Feb 16 '12 at 11:16

2 Answers2

1

The basic solution is the best one: just a float[][]. That's almost certainly the most memory-efficient and the fastest solution, and very simple.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • Sorry i left out additional information that led to the above question, would appreciate it if you could shed some light on it – natchan Feb 16 '12 at 11:21
1

2M floats is not that much at all, it will be perfectly fine to put them all in a list. One list for A, one for B. If A and B are multidimensional, float[][] is just fine. If you find you are running out of memory, try loading the whole B first, but one data point from A at a time.

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • Sorry i left out additional information that led to the above question, would appreciate it if you could shed some light on it – natchan Feb 16 '12 at 11:21