0

hope you all are fine. I'm reading files from a directory using structured streaming

schema = StructType([
    StructField("RowNo", StringType()),
    StructField("InvoiceNo", StringType()),
    StructField("StockCode", StringType()),
    StructField("Description", StringType()),
    StructField("Quantity", StringType()),
    StructField("InvoiceDate", StringType()),
    StructField("UnitPrice", StringType()),
    StructField("CustomerId", StringType()),
    StructField("Country", StringType()),
    StructField("InvoiceTimestamp", StringType())
])

data = spark.readStream.format("orc").schema(schema).option("header", "true").option("path", "<path_here>").load()

After applying some transformations, I like to save the output files with size of 100MB.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Abdul Haseeb
  • 442
  • 4
  • 22
  • You are reading a file and writing as file. Why spark structured streaming? Does it act like a file watcher? – Jaison Jul 13 '20 at 06:17

1 Answers1

0

You should override the default HDFS blocksize.

block_size = str(1024 * 1024 * 100)

sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)

Reference: How to change hdfs block size in pyspark?

pissall
  • 7,109
  • 2
  • 25
  • 45