0

This seems related to

How to change hdfs block size in pyspark?

I can successfully change the hdfs block size with rdd.saveAsTextFile, but not the corresponding DataFrame.write.parquet and unable to save with parquet format.

Unsure whether it's the bug in pyspark DataFrame or I did not set the configurations correctly.

The following is my testing code:

##########
# init
##########
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

import hdfs
from hdfs import InsecureClient
import os

import numpy as np
import pandas as pd
import logging

os.environ['SPARK_HOME'] = '/opt/spark-2.2.1-bin-hadoop2.7'

block_size = 512 * 1024

conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size))

spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.blocksize", block_size)
spark.sparkContext._jsc.hadoopConfiguration().setInt("dfs.block.size", block_size)

##########
# main
##########

# create DataFrame
df_txt = spark.createDataFrame([\{'temp': "hello"}, \{'temp': "world"}, \{'temp': "!"}])

# save using DataFrameWriter, resulting 128MB-block-size

df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')

# save using rdd, resulting 512k-block-size
client = InsecureClient('http://spark1:50070')
client.delete('/tmp/temp_with_rrd', recursive=True)
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')
philantrovert
  • 9,904
  • 3
  • 37
  • 61
  • AFAIK Spark SQL stopped using Hadoop configuration in 2.0 – zero323 Mar 14 '18 at 14:28
  • @user69 How would it read from HDFS or use YARN?? The hadoop configuration is within the context of the session – OneCricketeer Mar 15 '18 at 05:46
  • It looks like it's parquet-specific issue. I can successfully save with 512k block-size with df.write.csv() and df.write.text() http://apache-spark-developers-list.1001551.n3.nabble.com/pyspark-DataFrameWriter-ignores-customized-settings-td23584.html – chhsiao1981 Mar 16 '18 at 18:16

2 Answers2

1

Found the answer from the following link:

https://forums.databricks.com/questions/918/how-to-set-size-of-parquet-output-files.html

I can successfully setup parquet block size with spark.hadoop.parquet.block.size

The following is the sample code:

# init
block_size = 512 * 1024 

conf = SparkConf().setAppName("myapp").setMaster("spark://spark1:7077").set('spark.cores.max', 20).set("spark.executor.cores", 10).set("spark.executor.memory", "10g").set('spark.hadoop.parquet.block.size', str(block_size)).set("spark.hadoop.dfs.blocksize", str(block_size)).set("spark.hadoop.dfs.block.size", str(block_size)).set("spark.hadoop.dfs.namenode.fs-limits.min-block-size", str(131072))

sc = SparkContext(conf=conf) 
spark = SparkSession(sc) 

# create DataFrame 
df_txt = spark.createDataFrame([{'temp': "hello"}, {'temp': "world"}, {'temp': "!"}]) 

# save using DataFrameWriter, resulting 512k-block-size 

df_txt.write.mode('overwrite').format('parquet').save('hdfs://spark1/tmp/temp_with_df')

# save using DataFrameWriter.csv, resulting 512k-block-size 
df_txt.write.mode('overwrite').csv('hdfs://spark1/tmp/temp_with_df_csv') 

# save using DataFrameWriter.text, resulting 512k-block-size

df_txt.write.mode('overwrite').text('hdfs://spark1/tmp/temp_with_df_text')

# save using rdd, resulting 512k-block-size 
client = InsecureClient('http://spark1:50070') 
client.delete('/tmp/temp_with_rrd', recursive=True) 
df_txt.rdd.saveAsTextFile('hdfs://spark1/tmp/temp_with_rrd')
0

Hadoop and Spark are two independent tools which have their own strategies to work. Spark and Parquet work with data partitions and block size is not meaningful for them. Do what Spark does say and then do what you want with it inside the HDFS.

You can change your Parquet partition number by

df_txt.repartition(6).format("parquet").save("hdfs://...")
Mobin Ranjbar
  • 1,320
  • 1
  • 14
  • 24
  • I think what you mean to say is that HDFS has an independent configuration cluster-wide that's not configured per Spark application – OneCricketeer Mar 15 '18 at 05:48