0

could somebody help me with these. I got very big files (csv format with 5 columns) aprox 500Mb-1Gb wich i need to insert in greenplum database. I use source file to read these files with option --mode=lines and sink gpfdist to import these data in greenplum but speed of this operation is very very poor. How can i tune this ?? i try channging options batchcount flushcount flushtime batchtime and etc but without luck. With gpload it only takes ~20-30sec to insert file ~800Mb.

file --directory=/data --filename-pattern=*.csv --mode=lines --prevent-duplicates=false --markers-json=false | gpfdist --db-user=**** --db-name=**** --column-delimiter=, --mode=insert --gpfdist-port=8000 --db-password=**** --db-host=**** --table=test --flush-count=200 --batch-count=1000000 --batch-period=2

tnx

  • I'll try to check optional settings for this use case. But first, what binder/transport you are using(rabbit?/kafka?) and did you check transfer rate from there between source and sink. – Janne Valkealahti Dec 01 '16 at 09:26
  • rabbit 3.6.5 .. I only see in rabbit UI that conecction is in FLOW state. here is picture from UI https://s15.postimg.org/kszvjx8bf/Screenshot_from_2016_12_01_12_26_01.png – Berislav Purgar Dec 01 '16 at 11:26
  • You can see individual queue stats from Queues tab. Rabbit is relatively slow and you probably see anything from 1K to 20K msgs/s. This is your first bottleneck. There's also a `throughput` sink which simply logs message rates. – Janne Valkealahti Dec 01 '16 at 11:35
  • file --directory=/data --prevent-duplicates=false --filename-pattern=*.csv --mode=lines | gpfdist --db-user=xxxx --db-name=xxxx --column-delimiter=, --mode=insert --segment-reject-limit=100000 --gpfdist-port=8000 --segment-reject-type=rows --db-password=xxxx --db-host=xxxx --flush-count=200 --batch-count=100000 --table=devices. Transfer rates are https://s16.postimg.org/h188zmthx/Screenshot_from_2016_12_01_14_31_01.png – Berislav Purgar Dec 01 '16 at 13:29
  • Ok . i switched to kafka and create troughput sink. This are results : http://pastebin.com/em8CtvBW . If i read corect i got approx 50k msg/s .. But still i got slow insert with gpfdist :( – Berislav Purgar Dec 12 '16 at 08:24
  • Lemmy try to play with an aggregated application(running sink/source in a same app bypassing external bus) which would be kinda direct binding in XD. This would not help with a dataflow as as we speak(some of needed features are coming with 1.2.x). – Janne Valkealahti Dec 12 '16 at 17:11

0 Answers0