I am using mpi and h5py/hdf5 (Hdf5 and h5py were compiled to have parallel capabilities and everything runs on python 3.4.) on a cluster to stitch a dataset of overlapping tiles (200 or more images/numpy arrays 2048x2048).
Each tile has an assigned index number that correspond to the position where it should be written and all the indexes are stored in an array:
Example tile_array
:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=int32)
The issue I have are:
When Running the code on the cluster:
- When I use multiple cores to write the stitched image in parallel I get randomly missing tiles.
- When I run the stitching serially in rank==0 I still have missing tiles.
- If the dataset is small there are no issues.
- If do not activate atomic mode in the parallel writing the number of errors increases. The number of errors increases if the atomic mode is
activated when I run the writing on rank==0
Running the code on a laptop:
- When I run the same code on single core in a laptop, stripped of the mpi code, no tiles are missing.
Any comment will be greatly appreciated.
Highlight of the procedure:
0) Before stitching, each image of the set is processed and the overlapping edges are blended in order to get a better result in the overlapping regions
1) I open the hdf file where I want to save the data: stitching_file=h5py.File(fpath,'a',driver='mpio',comm=MPI.COMM_WORLD)
2) I create a big dataset that will contain the stitched images
3) I fill the dataset with zeros
4) The processed images will be written (added) to the big dataset by different cores. However, adjacent images have overlapping regions. In order to avoid that different cores will write tiles with some overlap and break the writing I will run four rounds of writing in order to cover the whole dataset with this kind of patttern:
Example tile_array
:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=int32)
Writing rounds:
writing round one: array([ 0, 2, 8, 10], dtype=int32)
writing round two: array([ 1, 3, 9, 11], dtype=int32)
writing round three: array([ 4, 6, 12, 14], dtype=int32)
writing round four: array([ 5, 7, 13, 15], dtype=int32)
If I write on single rank I don't use this scheme and just write each position sequentially
Pseudocode
# Preprocessing of the images
------------------------------
------------------------------
# MPI data chunking function
def tasks_chunking(tasks,size):
# This function scatter any kind of list, not only ordered ones
# If the number of cores is bigger than the length of the list
# the size of the chunk will be zero and the
# Chunk the list of tasks
Chunked_list=np.array_split(tasks,size)
NumberChunks=np.zeros(len(Chunked_list),dtype='int32')
for idx,chunk in enumerate(Chunked_list):
NumberChunks[idx]=len(chunk)
# Calculate the displacement
Displacement=np.zeros(len(NumberChunks),dtype='int32')
for idx,count in enumerate(NumberChunks[0:-1]):
Displacement[idx+1]=Displacement[idx]+count
return Chunked_list,NumberChunks,Displacement
# Flush the file after the preprocessing
stitching_file.flush()
# Writing dictionary. Each group will contain a subset of the tiles to be written.
# Data is the size of the final array where to save the data
fake_tile_set = np.arange(Data.size,dtype='int32')
fake_tile_set = fake_tile_set.reshape(Data.shape)
writing_dict = {}
writing_dict['even_group1']=fake_tile_array[::2, ::2].flat[:]
writing_dict['even_group2']=fake_tile_array[::2, 1::2].flat[:]
writing_dict['odd_group1'] =fake_tile_array[1::2, ::2].flat[:]
writing_dict['odd_group2'] =fake_tile_array[1::2, 1::2].flat[:]
# Withouth activating the atomic mode the number of errors in parallel mode is higher.
# Withouth activating the atomic mode the number of errors in parallel mode is higher.
stitching_file.atomic=True
# Loop through the dictionary items to start the writing
for key, item in writing_dict.items():
# Chunk the positions that need to be written
if rank==0:
Chunked_list,NumberChunks,Displacement=tasks_chunking(item,size)
else:
NumberChunks = None
Displacement = None
Chunked_list= None
# Make the cores aware of the number of jobs that will need to run
# The variable count is created by the scatter function and has the number of
# processes and is different in every core
cnt = np.zeros(1, dtype='int32')
comm.Scatter(NumberChunks, cnt, root=0)
# Define the local variable that will be filled up wuth the scattered data
xlocal = np.zeros(cnt, dtype='int32') # Use rank for determine the size of the xlocal on the different cores
# Scatter the value of tasks to the different cores
comm.Scatterv([item,NumberChunks,Displacement,MPI.INT],xlocal, root=0)
for tile_ind in xlocal:
# This writing function is the same called when I run the writing on rank 0 or on a single core in a laptop.
paste_in_final_image(joining, temp_file, stitched_group, tile_ind, nr_pixels)
comm.Barrier()
stitching_file.flush()
stitching_file.close()