-1

Greenplum says that it has parallel data loading. I have a doubt regarding how it works. Please do explain it to me. I understand that records are read in parallel but I can't understand how parallel writes are done. Is the parallel writes done on the same database or is it done on different databases(segments)? Please do explain. Thanks

navin
  • 384
  • 2
  • 4
  • 15
  • 1
    -1: this is explained in Admin guide, chapter 12 "Loading and Unloading Data" – mys Nov 22 '12 at 12:32

3 Answers3

1

The parallel writes are done on different segments, with data being fed by 1 or more instances of gpfdist running on the ETL server(s). I suspect a significant part of the magic is the distributed by extension that is used to scatter the rows of a database across the segment servers.

John Percival Hackworth
  • 11,395
  • 2
  • 29
  • 38
0

Concurrent reads/writes can be done at segment level with the help of gpfdist or gphdfs.

For example, if you want to unload data to a file on disk, you can use a writable external table which connects to several gpfdist locations, and each data segment would write data to those destinations is parallel.

leonkhu
  • 11
  • 1
0

John is correct.

Each instance of gpfdist, by default, will handle 4 concurrent connections. When loading, each segment with a connection will read their "chunk" of data and process according the distribution hash of the table.

See: https://blog.2ndquadrant.com/parallel_etl_with_greenplum/