How to read different interval of lines from a text file in different processes using MPI in C

Question

I am trying to portion out 1 million lines of float numbers to 16 different processes. For example, process 0 needs to read between lines 1-62500 and process 1 needs to read between lines 62501-125000 etc.

I have tried the following code, but every process reads the lines between 1-62500. How can I change the line interval for each process?

MPI_Init(NULL, NULL);

n=1000000/numberOfProcesses;

FILE *myFile;

myFile = fopen("input.txt","r");
i=0;
k = n+1;
while(k--){
    fscanf(myFile,"%f",&input[i]);
    i++;
}
fclose(myFile);

MPI_Finalize();

Does this answer your question? [MPI Reading from a text file](https://stackoverflow.com/questions/12939279/mpi-reading-from-a-text-file) — Eraklon, Mar 06 '20 at 11:21
@Eraklon Each process must processed on the file line range which concerns it. This is not reflected in your program. Your processes cannot guess it range of lines — Landstalker, Mar 06 '20 at 11:50
Are all lines the same length? If not, the only way to find where the lines are is to read the entire file until you reach the lines you need. — Andrew Henle, Mar 06 '20 at 11:58
No, line length varies between 3 and 4 depends on if the number is positive or negative. In other words, every line contains numbers in the fortmat as follows: -x.x or x.x @AndrewHenle — Deniz Ünal, Mar 06 '20 at 12:31

j23 · Answer 1 · 2020-03-06T14:49:23.770

1

Assuming numbeOfProcesses=4 and numberOfLines=16

//so new n will be 4
//n=1000000/numberOfProcesses;
n=numberOfLines/numbeOfProcesses
FILE *myFile;

myFile = fopen("input.txt","r");
i=0;
k = n+1 //(5)

From your program, all processes will read the file from the same location or offset. What you need to do is to make each process read from their own specific line or offset. For example, rank 0 should read from 0, rank 1 from n, rank 2 from 2*n etc. Pass this as parameter to fseek.

n=numberOfLines/numbeOfProcesses
MPI_Comm_rank(MPI_COMM_WORLD,&rank)
file_start= n*rank
fseek(myfile, file_start, SEEK_SET);

fseek will go the offset (file_start) of the file. Then file_start will be 4 for rank 0, 8 for rank 1 etc...

Also while loop should be modified accordingly.

As @Gilles pointed out in comments, here we are explicitly assuming the number of lines in the file. This can lead to many issues.

To get scalability and parallel performance benefits, it is better to use MPI IO, which offers great features for parallel file operations. MPI IO is developed for this kind of usecases.

edited Mar 06 '20 at 14:49

answered Mar 06 '20 at 11:53

j23

3,139
1
6
13

1

You are implicitly assuming the number of lines is known before opening the file. Also, if the line size is not constant, all the ranks will end up reading the same lines which is inefficient. As you suggested, better use binary format and MPI-IO. – Gilles Gouaillardet Mar 06 '20 at 12:21
You are right, k value for your specific example will be 5 for each processor. However, it should be the same for each processor since they need to read same amount of lines from the text file. In your code, first processor reads first 4, second reads first 8. This is not I meant, it should be like first processor reads first 4, second reads between 5 and 8. Changing the value of k only changes how many lines it should read from the beggining. – Deniz Ünal Mar 06 '20 at 12:28
@DenizÜnal for that you should use fseek() and sets the start of the file pointer to the desired offset. Then read k lines from that position. See my updated answer. For this usecase MPI IO is better. – j23 Mar 06 '20 at 13:48
The line size is not fixed, and there is no way to directly fseek() to a given line. An option to add a bit of parallelism is to read lines in the master, scatter offsets (in characters) and have each task re-read the lines and sscanf() them (parsing a line is faster than parsing a few floats). – Gilles Gouaillardet Mar 07 '20 at 02:34
An other option would be to have each task parse a portion of the file, skip the first char, parse data and read more data to correctly handle incomplete lines, and figure out which is the number of the first line they parsed. In any case, binary format and MPI-IO or a higher abstraction (hdf5, netcdf or adios) is a much better fit if they can be used. – Gilles Gouaillardet Mar 07 '20 at 02:38

How to read different interval of lines from a text file in different processes using MPI in C

1 Answers1