Is it possible to write data in a output file in parallel

Question

Is it possible to write data on a single output file with more processors. I mean consider some processors have a part of a data (e.g. a matrix) and the whole matrix should be written in a single output file. Is it possible each processors write own parts in parallel (at the same time not one after another)?

This refers... http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture32.pdf — Mark Setchell, Sep 01 '19 at 09:32

Zulan · Answer 1 · 2019-09-17T21:03:52.387

Yes it is absolutely possible, and MPI gives you all the tools to do so.

A great introduction to MPI I/O was already linked in the comments. I'm just going to a minimal example to demonstrate it;

#include <stdint.h>
#include <mpi.h>
#include <assert.h>
#include <stdlib.h>
#include <stdio.h>

const int N = 1024ll * 1024 * 256 * 12;
const MPI_Datatype MPI_T = MPI_UINT64_T;
typedef uint64_t T;
const char* filename = "mpi.out";

int main() {
    MPI_Init(NULL, NULL);
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    assert(N % size == 0);
    T* my_part = calloc(N / size, sizeof(T));
    for (size_t i = 0; i < N / size; i++)
        my_part[i] = i + rank * (N / size);

    MPI_File fh;
    MPI_File_open(MPI_COMM_WORLD, filename,
                  MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL,
                  &fh);
    MPI_File_set_view(fh, rank * (N / size) * sizeof(T), MPI_T, MPI_T,
                      "native", MPI_INFO_NULL);

    MPI_Barrier(MPI_COMM_WORLD);
    double begin = MPI_Wtime();
    MPI_File_write_all(fh, my_part, N / size, MPI_T, MPI_STATUS_IGNORE);
    MPI_Barrier(MPI_COMM_WORLD);
    double duration = MPI_Wtime() - begin;
    if (rank == 0)
        printf("Wrote %llu B in %f s, %f GiB/s\n",
               N * sizeof(T), duration,
               N * sizeof(T) / (duration * 1024 * 1024 * 1024));

    MPI_File_close(&fh);
    MPI_Finalize();
}

With very minimal tuning (setting striping size to 12) and 12 ranks, this gets quite reasonable performance on Lustre of ~7.3 GiB/s. Note that this exceeds the raw throughput of a 4xFDR InfiniBand used by the system.

Typically you would use custom data types with such file views, or even more high-level I/O libraries such as HDF5 that work on top of MPI I/O. Getting optimal performance will probably require some site-specific tuning.

In practice, this is sufficient to know, but referring to the discussion on the other answer by user3666197 a few more nitpicky details:

This code snippet expresses concurrent file writes at a quite high-level of abstraction. You get truly parallel I/O by executing this code on an HPC system with a parallel file system. It is absolutely possible that, in a well-tuned configuration, that the bytes written from the different ranks follow entirely different paths and end up on different disks in different storage servers - all in parallel. What matters in your code is to express the concurrency - then you apply tuning to make sure it performs well - which means to allow the storage system to execute it efficiently and in parallel.

score -2 · Answer 2 · answered Sep 01 '19 at 19:23

-2

Q : Is it possible each processors write own parts in parallel (at the same time not one after another)?

No, it is not. Writing is a process of putting atomic-pieces-of information ( bits to tape, characters to a file-abstraction ) in a pure-[SERIAL] fashion.

If in doubts, take and hold 5 pencils in your hand ( I cannot make it, so it fine to just imagine that one can ) and try to write one word on a paper ( due to "process"-of-writing related circumstances - a singularity of how we write ), there is impossible to "write" 5-independent i.e. different words in this simplified example.

Similarly, in other form of illustration - if you have a typewriter machine ( hope it is not so archaic imagination ) - one can get 5-copies of the same pure-[SERIAL] sequence-of-characters ( thanks to the use of 4-pieces of carbon-copy papers filled between those 5-sheets of office paper ), yet none of these copies will differ from the original - so these are not independent ( like these would be in true-[PARALLEL] processes ) but just a set of replicas, which is productive use of time and resources in producing some paperwork for sending 1-original + 4-copies in some administrative Matrix, but not an example of true-[PARALLEL] writes.

Last but not least, any attempt to use more fingers at once than the one and only the one, for typing on a typewriter ( which puts a pure-[SERIAL] sequence of chars printed on paper ) will produce a mechanical jam, as the process of mechanical type-writing relies on a singular-point, where a character is printed, through a hit of an ink-banner, onto a paper.

Modern filesystems are far from this trivial archetype, yet have a similar concept of producing and maintaining a pure-[SERIAL] representation of a sequence of characters. Even while it is possible to open more filehandles having access "into" this sequence-of-chars, these do not mean one has a chance to make the file-I/O operations de-serialised, the less to take at-once ( as disk-heads are not present at several different locations of the magnetic-disk storage ( the less for tape-device ) at-once and neither the almost-random-access devices, like SSD et al, do not go this wild way, where they lose the control of low-level properties ( wear-leveling, elevator-optimisations, power-limiting and similar low-level device tricks ).

answered Sep 01 '19 at 19:23

user3666197

1
6
50
92

2

Your analogy is irrelevant because storage solutions, fortunately, don't have the same limitations typewriters/fingers/single disks suffer from. Even if we were to follow the analogy you have 5 writers writing 5 chapters (matrix rows) on 5 typewriters (storage targets of a parallel file system) - in parallel -, which are then glued together to a book (file). There you go! – Zulan Sep 11 '19 at 13:51
@Zulan All modern F/S try to minimise the blind-spot, when a content-modifying write has to occur. Some tricks hide it smarter, some less smart. Any such modification *cannot occur in parallel* ~ **Dilemma: which one of the many parallel values to write is to be stored?** Diving deeper into the costs of maintaining the cache-coherency across the cluster-wide distributed Lustre/PFS and the actual (hidden from users' sight) ordering of changes of the meta-data, of the content, of the locks and the cache-coherency are never parallel on the lowest level. **Concurrent?** Yes, but **not parallel** – user3666197 Sep 11 '19 at 14:56
The question is not about metadata. It is about data. Specifically, about non-overlapping areas in a single file. For which this scheme applies: https://www.nics.tennessee.edu/files/images/striped-view-A.jpg - Or you could just read the [already suggested lecture](http://wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture32.pdf). If you continue to disagree with Prof. Gropp, I suggest you back your claims with credible sources or an argument that doesn't rely on analogies. – Zulan Sep 11 '19 at 15:15
Thanks for reminding one of the technology-layer trick ( the stripping ). With all respect to all opponents, the stripping is not a solution, it is a mere strategy ( since its first implementation deep back in 90-ies to lower the rate of colliding access-requests, to further enjoy stochastic chances to lower an access-latency ( if complemented with 1:N replicas,some of which might be faster to access than the "local" write ) and to harness distributed resources, thus avoiding "local" bottlenecks in the flow-of-data to/from device. Yet, the stripping has nothing to do with true-[PARALLEL] write – user3666197 Sep 11 '19 at 15:21
Stripping permits to marshall a write to take place on a "distributed" resource, yet if all the 5 ( claimed-to-be-true-[PARALLEL] ) writers are instructed to put a letter ( the data-content element ) onto a page 136, 3rd row, last-but one character printed on that row, the claim proves itself to be false. These 5 men cannot obey in a true-[PARALLEL] fashion such instruction from "their" own centre-of-command ( the true-[PARALLEL]-process 1 asking to type and put "there" a letter A, a true-[PARALLEL]-process 2 asking to type a letter B, the true-[PARALLEL]-process 3 asking to type a letter Z ) – user3666197 Sep 11 '19 at 15:29
1

Concurrent access to the same position in a file is not a requirement of the question. – Zulan Sep 11 '19 at 15:33
@Zulan summed up, the stripping is not the solution, the "collision avoidance" was. So it is not the property of the claimed F/S, but an a priori given warranty ( obtained from a promise ( for which no one says, what will happen, if the said promise will not get fulfilled under some future state or conditions in the run-time ecosystem ) ) that permits the said stripping to "survive" that specifically orchestrated use-case, but **that does not generalise into a solution for a true-[PARALLEL] writes.** So, kindly do not try to exchange an effect with a cause in the chain of argumentation. – user3666197 Sep 11 '19 at 15:34

Is it possible to write data in a output file in parallel

2 Answers2