Writing DataFrame as parquet creates empty files

Question

I am trying to do some performance optimization for Spark job using bucketing technique. I am reading .parquet and .csv files and do some transformations. After I am doing bucketing and join two DataFrames. Then I am writing joined DF to parquet but I have an empty file of ~500B instead of 500Mb.

Cloudera (cdh5.15.1)
Spark 2.3.0

Blob

val readParquet = spark.read.parquet(inputP)
readParquet
    .write
    .format("parquet")
    .bucketBy(23, "column")
    .sortBy("column")
    .mode(SaveMode.Overwrite)
    .saveAsTable("bucketedTable1")

val firstTableDF = spark.table("bucketedTable1")

val readCSV = spark.read.csv(inputCSV)
readCSV
    .filter(..)
    .ordrerBy(someColumn)

    .write
    .format("parquet")
    .bucketBy(23, "column")
    .sortBy("column")
    .mode(SaveMode.Overwrite)
    .saveAsTable("bucketedTable2")

val secondTableDF = spark.table("bucketedTable2")

val resultDF = secondTableDF
    .join(firstTableDF, Seq("column"), "fullouter")
    .
    .
resultDF
    .coalesce(1)
    .write
    .mode(SaveMode.Overwrite)
    .parquet(output)

When I launch Spark job in command line using ssh I have correct result, ~500Mb parquet file which I can see using Hive. If I run the same job using oozie workflow I have an empty file (~500 Bytes). When I do .show() on my resultDF I can see the data but I have empty parquet file.

+-----------+---------------+----------+
|       col1|          col2 |      col3|
+-----------+---------------+----------+
|33601234567|208012345678910|       LOL|
|33601234567|208012345678910|       LOL|
|33601234567|208012345678910|       LOL|

There is no problem writing to parquet when I am not saving data as a table. It occurs only with DF created from table.

Any suggestions ?

Thanks in advance for any thoughts!

forget about existing dataframe write this dataframe via oozie in the same flow before you write existing problematic parquet like this `spark.sparkContext.parallelize(1 to 4).toDF .coalesce(1) .write.mode(SaveMode.Overwrite).parquet(destDir)` see what happens and dont think this is spark issue. it might be some other issue — Ram Ghadiyaram, Jun 19 '19 at 17:38
@RamGhadiyaram Yes there is _SUCCESS file with it. Without bucketing and ```.saveAsTable()``` it works good, I've just tried to avoid Shuffle problem. — Niko, Jun 20 '19 at 07:10
@RamGhadiyaram And your suggestion with ```spark.sparkContext.parallelize(1 to 4).toDF .coalesce(1) .write.mode(SaveMode.Overwrite).parquet(destDir)``` works with Oozie. I can query the data in table. — Niko, Jun 20 '19 at 08:21
Seq("coumn"), might be Seq("column") i consider this as type remaining everthing look good for me — Ram Ghadiyaram, Jun 20 '19 at 14:38
@RamGhadiyaram No. Even when I am saving a single DF to a table without `.bucketBy` and `.sortBy` there is the same problem. It can't read from table. — Niko, Jun 20 '19 at 14:55
Bucketing is enabled by default. Spark SQL uses `spark.sql.sources.bucketing.enabled configuration` property to control whether bucketing should be enabled and used for query optimization or not. can you check this property ? — Ram Ghadiyaram, Jun 20 '19 at 20:33
can reffer this https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html which is same as what you are doing... but Bucketing is not supported for `DataFrameWriter.save`, `DataFrameWriter.insertInto` and `DataFrameWriter.jdbc` methods. — Ram Ghadiyaram, Jun 20 '19 at 20:36
@RamGhadiyaram Bucketing is enabled I can verify it in SparkHistory. I am using the page of @jaceklaskowski and for `.saveAsTable()` bucketing is supported. It's even work when I launch Spark job in command line but not in Oozie wf. — Niko, Jun 21 '19 at 08:10
i hope sprak 2.3 onwards bucketing was supported. please verify oozie libraries are correct version of spark 2.3. this is the last suggestion i can give. — Ram Ghadiyaram, Jun 21 '19 at 14:27

score 1 · Accepted Answer · answered Jun 25 '19 at 13:43

I figured it out for my case I just added an option .option("path", "/sources/tmp_files_path"). Now I can use bucketing and I have a data in my output files.

readParquet
  .write
  .option("path", "/sources/tmp_files_path")
  .mode(SaveMode.Overwrite)
  .bucketBy(23, "column")
  .sortBy("column")
  .saveAsTable("bucketedTable1")

Writing DataFrame as parquet creates empty files

1 Answers1