Parallel Data Loading in Greenplum

Question

Greenplum says that it has parallel data loading. I have a doubt regarding how it works. Please do explain it to me. I understand that records are read in parallel but I can't understand how parallel writes are done. Is the parallel writes done on the same database or is it done on different databases(segments)? Please do explain. Thanks

-1: this is explained in Admin guide, chapter 12 "Loading and Unloading Data" — mys, Nov 22 '12 at 12:32

score 1 · Answer 1 · answered Nov 21 '12 at 20:14

The parallel writes are done on different segments, with data being fed by 1 or more instances of gpfdist running on the ETL server(s). I suspect a significant part of the magic is the distributed by extension that is used to scatter the rows of a database across the segment servers.

score 0 · Answer 2 · answered Feb 14 '13 at 00:38

Concurrent reads/writes can be done at segment level with the help of gpfdist or gphdfs.

For example, if you want to unload data to a file on disk, you can use a writable external table which connects to several gpfdist locations, and each data segment would write data to those destinations is parallel.

score 0 · Answer 3 · answered Aug 30 '17 at 13:16

John is correct.

Each instance of gpfdist, by default, will handle 4 concurrent connections. When loading, each segment with a connection will read their "chunk" of data and process according the distribution hash of the table.

See: https://blog.2ndquadrant.com/parallel_etl_with_greenplum/

Parallel Data Loading in Greenplum

3 Answers3