Is it possible to write data on a single output file with more processors. I mean consider some processors have a part of a data (e.g. a matrix) and the whole matrix should be written in a single output file. Is it possible each processors write own parts in parallel (at the same time not one after another)?
-
3This refers... http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture32.pdf – Mark Setchell Sep 01 '19 at 09:32
2 Answers
Yes it is absolutely possible, and MPI gives you all the tools to do so.
A great introduction to MPI I/O was already linked in the comments. I'm just going to a minimal example to demonstrate it;
#include <stdint.h>
#include <mpi.h>
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>
const int N = 1024ll * 1024 * 256 * 12;
const MPI_Datatype MPI_T = MPI_UINT64_T;
typedef uint64_t T;
const char* filename = "mpi.out";
int main() {
MPI_Init(NULL, NULL);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
assert(N % size == 0);
T* my_part = calloc(N / size, sizeof(T));
for (size_t i = 0; i < N / size; i++)
my_part[i] = i + rank * (N / size);
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, filename,
MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL,
&fh);
MPI_File_set_view(fh, rank * (N / size) * sizeof(T), MPI_T, MPI_T,
"native", MPI_INFO_NULL);
MPI_Barrier(MPI_COMM_WORLD);
double begin = MPI_Wtime();
MPI_File_write_all(fh, my_part, N / size, MPI_T, MPI_STATUS_IGNORE);
MPI_Barrier(MPI_COMM_WORLD);
double duration = MPI_Wtime() - begin;
if (rank == 0)
printf("Wrote %llu B in %f s, %f GiB/s\n",
N * sizeof(T), duration,
N * sizeof(T) / (duration * 1024 * 1024 * 1024));
MPI_File_close(&fh);
MPI_Finalize();
}
With very minimal tuning (setting striping size to 12) and 12 ranks, this gets quite reasonable performance on Lustre of ~7.3 GiB/s. Note that this exceeds the raw throughput of a 4xFDR InfiniBand used by the system.
Typically you would use custom data types with such file views, or even more high-level I/O libraries such as HDF5 that work on top of MPI I/O. Getting optimal performance will probably require some site-specific tuning.
In practice, this is sufficient to know, but referring to the discussion on the other answer by user3666197 a few more nitpicky details:
This code snippet expresses concurrent file writes at a quite high-level of abstraction. You get truly parallel I/O by executing this code on an HPC system with a parallel file system. It is absolutely possible that, in a well-tuned configuration, that the bytes written from the different ranks follow entirely different paths and end up on different disks in different storage servers - all in parallel. What matters in your code is to express the concurrency - then you apply tuning to make sure it performs well - which means to allow the storage system to execute it efficiently and in parallel.

- 21,896
- 6
- 49
- 109
Q : Is it possible each processors write own parts in parallel (at the same time not one after another)?
No, it is not. Writing is a process of putting atomic-pieces-of information ( bits to tape, characters to a file-abstraction ) in a pure-[SERIAL]
fashion.
If in doubts, take and hold 5 pencils in your hand ( I cannot make it, so it fine to just imagine that one can ) and try to write one word on a paper ( due to "process"-of-writing related circumstances - a singularity of how we write ), there is impossible to "write" 5-independent i.e. different words in this simplified example.
Similarly, in other form of illustration - if you have a typewriter machine ( hope it is not so archaic imagination ) - one can get 5-copies of the same pure-[SERIAL]
sequence-of-characters ( thanks to the use of 4-pieces of carbon-copy papers filled between those 5-sheets of office paper ), yet none of these copies will differ from the original - so these are not independent ( like these would be in true-[PARALLEL]
processes ) but just a set of replicas, which is productive use of time and resources in producing some paperwork for sending 1-original + 4-copies in some administrative Matrix, but not an example of true-[PARALLEL]
writes.
Last but not least, any attempt to use more fingers at once than the one and only the one, for typing on a typewriter ( which puts a pure-[SERIAL]
sequence of chars printed on paper ) will produce a mechanical jam, as the process of mechanical type-writing relies on a singular-point, where a character is printed, through a hit of an ink-banner, onto a paper.
Modern filesystems are far from this trivial archetype, yet have a similar concept of producing and maintaining a pure-[SERIAL]
representation of a sequence of characters. Even while it is possible to open more filehandles having access "into" this sequence-of-chars, these do not mean one has a chance to make the file-I/O operations de-serialised, the less to take at-once ( as disk-heads are not present at several different locations of the magnetic-disk storage ( the less for tape-device ) at-once and neither the almost-random-access devices, like SSD et al, do not go this wild way, where they lose the control of low-level properties ( wear-leveling, elevator-optimisations, power-limiting and similar low-level device tricks ).

- 1
- 6
- 50
- 92
-
2Your analogy is irrelevant because storage solutions, fortunately, don't have the same limitations typewriters/fingers/single disks suffer from. Even if we were to follow the analogy you have 5 writers writing 5 chapters (matrix rows) on 5 typewriters (storage targets of a parallel file system) - in parallel -, which are then glued together to a book (file). There you go! – Zulan Sep 11 '19 at 13:51
-
@Zulan All modern F/S try to minimise the blind-spot, when a content-modifying write has to occur. Some tricks hide it smarter, some less smart. Any such modification *cannot occur in parallel* ~ **Dilemma: which one of the many parallel values to write is to be stored?** Diving deeper into the costs of maintaining the cache-coherency across the cluster-wide distributed Lustre/PFS and the actual (hidden from users' sight) ordering of changes of the meta-data, of the content, of the locks and the cache-coherency are never parallel on the lowest level. **Concurrent?** Yes, but **not parallel** – user3666197 Sep 11 '19 at 14:56
-
The question is not about metadata. It is about data. Specifically, about non-overlapping areas in a single file. For which this scheme applies: https://www.nics.tennessee.edu/files/images/striped-view-A.jpg - Or you could just read the [already suggested lecture](http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture32.pdf). If you continue to disagree with Prof. Gropp, I suggest you back your claims with credible sources or an argument that doesn't rely on analogies. – Zulan Sep 11 '19 at 15:15
-
Thanks for reminding one of the technology-layer trick ( the stripping ). With all respect to all opponents, the stripping is not a solution, it is a mere strategy ( since its first implementation deep back in 90-ies to lower the rate of colliding access-requests, to further enjoy stochastic chances to lower an access-latency ( if complemented with 1:N replicas,some of which might be faster to access than the "local" write ) and to harness distributed resources, thus avoiding "local" bottlenecks in the flow-of-data to/from device. Yet, the stripping has nothing to do with true-[PARALLEL] write – user3666197 Sep 11 '19 at 15:21
-
Stripping permits to marshall a write to take place on a "distributed" resource, yet if all the 5 ( claimed-to-be-true-[PARALLEL] ) writers are instructed to put a letter ( the data-content element ) onto a page 136, 3rd row, last-but one character printed on that row, the claim proves itself to be false. These 5 men cannot obey in a true-[PARALLEL] fashion such instruction from "their" own centre-of-command ( the true-[PARALLEL]-process 1 asking to type and put "there" a letter A, a true-[PARALLEL]-process 2 asking to type a letter B, the true-[PARALLEL]-process 3 asking to type a letter Z ) – user3666197 Sep 11 '19 at 15:29
-
1Concurrent access to the same position in a file is not a requirement of the question. – Zulan Sep 11 '19 at 15:33
-
@Zulan summed up, the stripping is not the solution, the "collision avoidance" was. So it is not the property of the claimed F/S, but an a priori given warranty ( obtained from a promise ( for which no one says, what will happen, if the said promise will not get fulfilled under some future state or conditions in the run-time ecosystem ) ) that permits the said stripping to "survive" that specifically orchestrated use-case, but **that does not generalise into a solution for a true-[PARALLEL] writes.** So, kindly do not try to exchange an effect with a cause in the chain of argumentation. – user3666197 Sep 11 '19 at 15:34