I had the Spark to HAWQ JDBC connection working, but now two days later there's an issue with extracting data from a table. Nothing's changed on the Spark configuration...
Simple step #1 - printing the schema from a simple table in HAWQ I can create a SQLContext DataFrame and connect to the HAWQ db:
df = sqlContext.read.format('jdbc').options(url=db_url, dbtable=db_table).load()
df.printSchema()
Which prints:
root
|-- product_no: integer (nullable = true)
|-- name: string (nullable = true)
|-- price: decimal (nullable = true)
But when actually trying to extract data:
df.select("product_no").show()
These errors pop up...
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
org.postgresql.util.PSQLException: ERROR: could not write 3124 bytes to temporary file: No space left on device (buffile.c:408) (seg33 adnpivhdwapda04.gphd.local:40003 pid=544124) (cdbdisp.c:1571)
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:615)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:465)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:350)
at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:372)
at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:350)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
Things I've tried (but willing to try again if there's more precise steps):
- Tried a 'df -i' on the HAWQ master node and there's only 1% utilization
- Tried a dbvacuum on the HAWQ database (VACUUM ALL isn't recommended on HAWQ)
- Tried creating this tiny new db (with the single table, 3 columns), no luck
This can't be an actual memory deficiency so where and what is tripping this up??