0

I have a large number (100s of millions) of fixed-size values stored in a random order on a disk. I have the same set of values stored in memory, in a different order. I need to store the values in the order they are in memory, on disk. The challenge is this: I need to keep at least one copy of each value on disk at any one time – i.e. it needs to be durable.

I have quite a bit of RAM to work with (the values take up only about 60%), a lot of ephemeral storage, but only a very small amount of space on the durable disk, enough for less than a million of the values.

Given a value on disk, I can find it in memory very very quickly. But the converse is not true, given a value in memory, it is very slow to find it on disk.

Given these limitations, what's the best algorithm to transfer the order of the values from memory, to disk, as fast as possible?

Max
  • 2,760
  • 1
  • 28
  • 47

1 Answers1

0

Sounds like you have a sorting problem, where your comparator is the order of elements in RAM (element x is 'bigger' than element y, if x appears after y in RAM).

It can be solved using an external sort.

Note that if you allow duplicates, some more processing needs to be done in order to make sure your comparator is valid (can be solved by enumerating the identical values, and assigning a 'dupe_id' to each duplicate - both in RAM and on disk)

amit
  • 175,853
  • 27
  • 231
  • 333