1

I had the Spark to HAWQ JDBC connection working, but now two days later there's an issue with extracting data from a table. Nothing's changed on the Spark configuration...

Simple step #1 - printing the schema from a simple table in HAWQ I can create a SQLContext DataFrame and connect to the HAWQ db:

df = sqlContext.read.format('jdbc').options(url=db_url, dbtable=db_table).load()
df.printSchema()

Which prints:

root
 |-- product_no: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- price: decimal (nullable = true)

But when actually trying to extract data:

df.select("product_no").show()

These errors pop up...

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): 
org.postgresql.util.PSQLException: ERROR: could not write 3124 bytes to temporary file: No space left on device (buffile.c:408)  (seg33 adnpivhdwapda04.gphd.local:40003 pid=544124) (cdbdisp.c:1571)
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:615)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:465)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:350)
    at org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.<init>(JDBCRDD.scala:372)
    at org.apache.spark.sql.jdbc.JDBCRDD.compute(JDBCRDD.scala:350)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
    at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)

Things I've tried (but willing to try again if there's more precise steps):

  • Tried a 'df -i' on the HAWQ master node and there's only 1% utilization
  • Tried a dbvacuum on the HAWQ database (VACUUM ALL isn't recommended on HAWQ)
  • Tried creating this tiny new db (with the single table, 3 columns), no luck

This can't be an actual memory deficiency so where and what is tripping this up??

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
WaveRider
  • 475
  • 4
  • 10
  • Could be a permission problem. Please check the postgres log; you are swimming in muddy water. With sunglasses. /ORM – wildplasser Sep 24 '15 at 23:32
  • Please show the full, unmodified output of `df -h` and `mount`, as well as `SHOW temp_tablespaces` from `psql`. Also +1 and thanks for showing full stack trace. – Craig Ringer Sep 25 '15 at 00:07

1 Answers1

1

could not write 3124 bytes to temporary file: No space left on device

The volume used for tempfiles is filling up. Then the temp file will be deleted on error, so you don't actually see the full volume in df.

This could be a tempfs, like /tmp, on most Linux systems. If so, it's backed by virtual memory. To confirm, check mount and check the setting of PostgreSQL's temp_tablespaces (SHOW temp_tablespaces). If it's blank PostgreSQL will use the default tablespace, which is unlikely to be a tempfs, but if it's set check where that tablespace is. If it's a on tempfs you may need to move it.

It could also be somehow filling the main tablespace, but that's exceedingly unlikely if it's only at 1% utilization at the moment. Perhaps a massively runaway recursive CTE could do it, but it's unlikely.

Quota management is also a possibility. Perhaps a filesystem quota is configured?

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
  • You're right @Craig Ringer, thank you! I wasn't seeing the /tmp being filled up. Sure enough, it was at 100% utilization. Don't know what gets written there, but HAWQ can't update tables so it just creates new ones (...and more new ones, who knows if there are any references?). One final note: the general admin account didn't show that 100% utilization, only after running `df` as root did the problem reveal itself. – WaveRider Sep 25 '15 at 19:19