Lets say I have N files in a format like this:
One File looks like this:
For each time there is some amount of data with different id
- time 1:
- data with id: 10
- data with id: 13
- data with id: 4
- time 2:
- data with id: 10
- data with id: 77
...etc
(for each time the data with ids from 1-1000 are spreaded some how (mixed) over these N files)
I would like to combine all these N files so that I have a single file which is ordered :
Final File:
- time 1:
- data with id: 1
- data with id: 2
- data with id: 3
- ...
- data with id: 1000
- time 2:
- data with id: 1
- data with id: 2
- data with id: 3
- ...
- data with id: 1000
...etc
The size of data id 1-1000 is approximately 100mb, but I have a lot of times which accounts for up to 50 Gbytes of data.
My solution for this problem would be so far like this to make this as fast as possible:
I use T threads on a supercomputer node (1 computer with e.g. 24-48 cores) (for example). I have allocated a shared memory array to hold all datas with ids 1 - 1000 for one time (can also be more if I like)
Procedure:
Step 1:
- Each thread has some files it opens and owns. Each thread then fills in the data of the ids it has in the files into the shared array.
Step 2:
- When all threads have finally processed one time --> Thread 1 writes this array in ordered form to the final file.
asdasd
- I would be very much interested if that is efficient? Is the parallel read not sequentialized anyway so it is for no use at all? I could compute the final file on a local computer with ultra fast SSD or on a cluster node with network storage (Lustres or Panasas Filesystems)
- Could I use all threads again in step 2 to write in parallel to the disk lets say with MPI IO (which supports parallel write by offsets), or how else can that be achieved? -> the c++ standart library?
Thanks for any inputs!