Reading data using Structured Streaming in Pyspark and wants to write data with file size of 100MB

Question

hope you all are fine. I'm reading files from a directory using structured streaming

schema = StructType([
    StructField("RowNo", StringType()),
    StructField("InvoiceNo", StringType()),
    StructField("StockCode", StringType()),
    StructField("Description", StringType()),
    StructField("Quantity", StringType()),
    StructField("InvoiceDate", StringType()),
    StructField("UnitPrice", StringType()),
    StructField("CustomerId", StringType()),
    StructField("Country", StringType()),
    StructField("InvoiceTimestamp", StringType())
])

data = spark.readStream.format("orc").schema(schema).option("header", "true").option("path", "<path_here>").load()

After applying some transformations, I like to save the output files with size of 100MB.

You are reading a file and writing as file. Why spark structured streaming? Does it act like a file watcher? — Jaison, Jul 13 '20 at 06:17

score 0 · Answer 1 · answered Jul 13 '20 at 06:03

0

You should override the default HDFS blocksize.

block_size = str(1024 * 1024 * 100)

sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)

Reference: How to change hdfs block size in pyspark?

answered Jul 13 '20 at 06:03

pissall

7,109
2
25
45

It is not working for me, It is creating 112MB of files – Abdul Haseeb Jul 13 '20 at 11:46
@AbdulHaseeb Sorry mate, I'll dig some more – pissall Jul 13 '20 at 11:49

Reading data using Structured Streaming in Pyspark and wants to write data with file size of 100MB

1 Answers1