3

I am trying to do some performance optimization for Spark job using bucketing technique. I am reading .parquet and .csv files and do some transformations. After I am doing bucketing and join two DataFrames. Then I am writing joined DF to parquet but I have an empty file of ~500B instead of 500Mb.

  • Cloudera (cdh5.15.1)
  • Spark 2.3.0
  • Blob

    val readParquet = spark.read.parquet(inputP)
    readParquet
        .write
        .format("parquet")
        .bucketBy(23, "column")
        .sortBy("column")
        .mode(SaveMode.Overwrite)
        .saveAsTable("bucketedTable1")
    
    val firstTableDF = spark.table("bucketedTable1")
    
    val readCSV = spark.read.csv(inputCSV)
    readCSV
        .filter(..)
        .ordrerBy(someColumn)
    
        .write
        .format("parquet")
        .bucketBy(23, "column")
        .sortBy("column")
        .mode(SaveMode.Overwrite)
        .saveAsTable("bucketedTable2")
    
    val secondTableDF = spark.table("bucketedTable2")
    
    val resultDF = secondTableDF
        .join(firstTableDF, Seq("column"), "fullouter")
        .
        .
    resultDF
        .coalesce(1)
        .write
        .mode(SaveMode.Overwrite)
        .parquet(output)
    

When I launch Spark job in command line using ssh I have correct result, ~500Mb parquet file which I can see using Hive. If I run the same job using oozie workflow I have an empty file (~500 Bytes). When I do .show() on my resultDF I can see the data but I have empty parquet file.

+-----------+---------------+----------+
|       col1|          col2 |      col3|
+-----------+---------------+----------+
|33601234567|208012345678910|       LOL|
|33601234567|208012345678910|       LOL|
|33601234567|208012345678910|       LOL|

There is no problem writing to parquet when I am not saving data as a table. It occurs only with DF created from table.

Any suggestions ?

Thanks in advance for any thoughts!

Niko
  • 373
  • 3
  • 13
  • empty file has _SUCCESS along with it ? – Ram Ghadiyaram Jun 19 '19 at 17:30
  • forget about existing dataframe write this dataframe via oozie in the same flow before you write existing problematic parquet like this `spark.sparkContext.parallelize(1 to 4).toDF .coalesce(1) .write.mode(SaveMode.Overwrite).parquet(destDir)` see what happens and dont think this is spark issue. it might be some other issue – Ram Ghadiyaram Jun 19 '19 at 17:38
  • @RamGhadiyaram Yes there is _SUCCESS file with it. Without bucketing and ```.saveAsTable()``` it works good, I've just tried to avoid Shuffle problem. – Niko Jun 20 '19 at 07:10
  • @RamGhadiyaram And your suggestion with ```spark.sparkContext.parallelize(1 to 4).toDF .coalesce(1) .write.mode(SaveMode.Overwrite).parquet(destDir)``` works with Oozie. I can query the data in table. – Niko Jun 20 '19 at 08:21
  • Seq("coumn"), might be Seq("column") i consider this as type remaining everthing look good for me – Ram Ghadiyaram Jun 20 '19 at 14:38
  • @RamGhadiyaram No. Even when I am saving a single DF to a table without `.bucketBy` and `.sortBy` there is the same problem. It can't read from table. – Niko Jun 20 '19 at 14:55
  • can you do msck repair on table? – Ram Ghadiyaram Jun 20 '19 at 15:06
  • Bucketing is enabled by default. Spark SQL uses `spark.sql.sources.bucketing.enabled configuration` property to control whether bucketing should be enabled and used for query optimization or not. can you check this property ? – Ram Ghadiyaram Jun 20 '19 at 20:33
  • can reffer this https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html which is same as what you are doing... but Bucketing is not supported for `DataFrameWriter.save`, `DataFrameWriter.insertInto` and `DataFrameWriter.jdbc` methods. – Ram Ghadiyaram Jun 20 '19 at 20:36
  • @RamGhadiyaram Bucketing is enabled I can verify it in SparkHistory. I am using the page of @jaceklaskowski and for `.saveAsTable()` bucketing is supported. It's even work when I launch Spark job in command line but not in Oozie wf. – Niko Jun 21 '19 at 08:10
  • i hope sprak 2.3 onwards bucketing was supported. please verify oozie libraries are correct version of spark 2.3. this is the last suggestion i can give. – Ram Ghadiyaram Jun 21 '19 at 14:27

1 Answers1

1

I figured it out for my case I just added an option .option("path", "/sources/tmp_files_path"). Now I can use bucketing and I have a data in my output files.

readParquet
  .write
  .option("path", "/sources/tmp_files_path")
  .mode(SaveMode.Overwrite)
  .bucketBy(23, "column")
  .sortBy("column")
  .saveAsTable("bucketedTable1")
Niko
  • 373
  • 3
  • 13