Does anyone know a good way to read/write files to my hdfs from within MPI? I've done a fair amount of digging trying to figure this out, and just need a general direction to pursue.
2 Answers
There is a full chapter of the MPI Standard about MPI I/O. I'd start by reading there.
MPI implementations have this implemented, usually using ROMIO. You can also take a look at that.

- 8,816
- 3
- 44
- 59
There are some oddities with HDFS that make it an interesting target for MPI-IO. Foremost, the restriction on modifications (writes) from more than one process.
It looks like the PLFS project (which takes MPI-IO style "all write to one file" workloads and changes them to "one file per process" workloads) has made HDFS one of its targets. This paper (with a whopping two citations) appears to be the reference? http://www.pdl.cmu.edu/PDL-FTP/HECStorage/CMU-PDL-12-115.pdf
So you'd have the MPI-IO interface, implemented by ROMIO. ROMIO has a device abstraction layer called ADIO, and PLFS can be one of those underlying devices (if you patch it). Then PLFS speaks HDFS and you finally perform I/O.
I have no idea how performant this stack is!

- 5,085
- 3
- 27
- 44