0

I have this dataframe:

date,AA,BB,CC
2018-01-01 00:00:00,45.73,0.0,1
2018-01-01 01:00:00,44.16,0.0,2
2018-01-01 02:00:00,42.24,0.0,3
2018-01-01 03:00:00,39.29,0.0,5
2018-01-01 04:00:00,36.0,0.0,6
2018-01-01 05:00:00,41.99,0.0,7
2018-01-01 06:00:00,42.25,0.0,8

I would like to if it is possible to read it with the MPI I/O paradigm.

In particular, I would like to divide the rows according to the number of processors. Suppose yo have 4 processors. I would like that each processor read two lines: processor 0, lines 1,2; processor 1, line 3,4; and so on.

I have studies some materials. As far I have understood I have to do a sort of offset and to write the file in one single line. Another possibility could be use something related to subgrids.

However, as you can noticed there are different kind of variables in each line.

Could someone of you give a glue? What I have found so far about MPI I/O is very theoretical and with no practical examples.

Thanks, Diego

diedro
  • 511
  • 1
  • 3
  • 15

1 Answers1

0

MPI-IO works great for binary data. It is less well suited for text data.

If this were binary data, I would expect a header and an index. Rank 0 could read that header and index, broadcast to everyone where the data resides, and then some algorithmic decomposistion of records could happen (e.g. each rank reads N records)

For an ascii file like this you're right: how do you split up the file?

How big are these files? If they are several megabytes big (so not that large), read the data on rank 0 and distribute from there

Another approach might be to generate an index -- either part of the dataframe or a separate binary index. That index would map records to file offsets and now you can split up the job of reading across all the proceses.

Rob Latham
  • 5,085
  • 3
  • 27
  • 44
  • I am not sure to have understood your answer.My files can be very large up to 10 GB. Could I convert my file in binary? if so I could try to study in order to apply your first suggestion. Where I can learn how to apply your last suggestions. As I told you. I am totally new in MPI-IO – diedro Mar 16 '21 at 22:37
  • ok, 10 GiB is big enough that read-and-broadcast isn't going to work. There's just no great way to read text files with MPI-IO . It comes up from time to time: https://stackoverflow.com/questions/41572586/reading-text-files-using-mpi-io https://stackoverflow.com/questions/12939279/mpi-reading-from-a-text-file – Rob Latham Mar 18 '21 at 01:54
  • ok. What about to convert it to a binary file and then read it with MPI-IO? – diedro Mar 18 '21 at 20:50
  • binary can work: you'd have to define a fixed format. maybe 64 bits each for 'timestamp','AA', 'BB', and 'CC' fields. then your MPI-IO reader knows a "record" is 32 bytes. If you have 100 processes reading this file, and you know the total file size, you can decompose based on rank: $ ((filesize/record_size)/nprocs)*rank $ would get you the starting record for each process – Rob Latham Mar 19 '21 at 14:29
  • Instead of thinking about MPI-IO, however, what about HDF5 or (parallel) NetCDF? You'd still want to think about "binary" instead of "text" but then instead of operating on bits and bytes you'd start thinking higher-level about "rows of an array" – Rob Latham Mar 19 '21 at 14:31
  • I have started to study NetCDF. It seems that NetCDF need grid data. I have not grids. I mean my grid is not a rectangular domain. Moving to NetCDF would imply to have more complex feature to hangle irragular grid. This is the reason that pushes me to simple text files. What do you think? – diedro Mar 21 '21 at 11:04
  • Definitely the case that lots of NetCDF applications are storing grids, but that is because of the applications, not NetCDF itself. All NetCDF lets you do is store a multi-dimensional array of typed data. You can also annotate that data. – Rob Latham Mar 22 '21 at 14:13
  • Dear Rob Latham could you please provide me a good tutorial where to learn how to write a NetCDF file? – diedro Mar 23 '21 at 21:50