1

I am using HDP 2.4.2 I want to connect Spark with HAWQ for data ingestion.

Please let me know if there is any recommended/correct approach, currently I am using postgress jdbc driver for connecting spark with HAWQ. I am facing issues like

-DataFrame creates table automatically in HAWQ if table is not present.

-Records ingestion is too slow.

-Intermittently is showing errors such as "org.postgresql.util.PSQLException: ERROR: relation "table_name" already exists".

nilesh1212
  • 1,561
  • 2
  • 26
  • 60

1 Answers1

1

Please see this example Scala project for reading HAWQ data via Spark RDD: https://github.com/kdunn926/sparkHawq

If you are hoping to read data generated by Spark with HAWQ, your best option will be to write to HDFS from Spark and use PXF to read it with HAWQ. See the documentation here: http://hdb.docs.pivotal.io/200/hawq/pxf/PivotalExtensionFrameworkPXF.html

Kyle Dunn
  • 380
  • 1
  • 9
  • Thank you Kyle for the answer. What would be the better approach to insert huge spark datasets into HAWQ? – nilesh1212 Oct 18 '16 at 14:03
  • If you want to avoid the intermediate persistence of data into HDFS, I think your best bet is to write the results from Spark in to Kafka and use Spring Cloud Dataflow's `gpfdist` sink module to do batch loading into HAWQ. The simplest solution is to just write the Spark dataset to HDFS as a compressed delimited format and read it in parallel with PXF. – Kyle Dunn Oct 18 '16 at 14:59
  • Kyle I think Spring Cloud Dataflow will be an overkill for this use case, Cant we use JDBC for inserting huge spark datasets into HAWQ? – nilesh1212 Oct 19 '16 at 14:01
  • JDBC does not support parallel loading, typically a requirement for any sizable dataset. You can use JDBC, for sure, however it will not be as performant as an approach using PXF or gpfdist. If you do decide on JDBC, just make sure it's using the "COPY" mechanism, rather than single transaction inserts. – Kyle Dunn Oct 19 '16 at 14:25
  • Thank you Kyle, So seems like write to HDFS from Spark and then use PXF to read it wth HAWQ is appropriate solution for production jobs. – nilesh1212 Oct 20 '16 at 11:31
  • You're welcome. Can you also mark the question answered? – Kyle Dunn Oct 20 '16 at 18:40