How to handle small file problem in spark structured streaming?

Question

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store.

I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files.

These parquet files need to be read latter by hive queries.

So 1) Is this strategy works in production environment ? or does it lead to any small file problem later ?

2) What are the best practices to handle/design this kind of scenario i.e. industry standard ?

3) How these kind of things generally handled in Production?

Thank you.

Srinivas · Accepted Answer · 2020-04-20T01:57:38.900

I know this question is too old. I had similar problem & I have used spark structured streaming query listeners to solve this problem.

My use case is fetching data from kafka & storing in hdfs with year, month, day & hour partitions.

Below code will take previous hour partition data, apply repartitioning & overwrite data in existing partition.

val session = SparkSession.builder().master("local[2]").enableHiveSupport().getOrCreate()
session.streams.addListener(AppListener(config,session))

class AppListener(config: Config,spark: SparkSession) extends StreamingQueryListener {
  override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
  override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
    this.synchronized {AppListener.mergeFiles(event.progress.timestamp,spark,config)}
  }
  override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = {}
}

object AppListener {

  def mergeFiles(currentTs: String,spark: SparkSession,config:Config):Unit = {
    val configs = config.kafka(config.key.get)
    if(currentTs.datetime.isAfter(Processed.ts.plusMinutes(5))) {

      println(
        s"""
           |Current Timestamp     :     ${currentTs}
           |Merge Files           :     ${Processed.ts.minusHours(1)}
           |
           |""".stripMargin)

      val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
      val ts = Processed.ts.minusHours(1)
      val hdfsPath = s"${configs.hdfsLocation}/year=${ts.getYear}/month=${ts.getMonthOfYear}/day=${ts.getDayOfMonth}/hour=${ts.getHourOfDay}"
      val path = new Path(hdfsPath)

      if(fs.exists(path)) {

      val hdfsFiles = fs.listLocatedStatus(path)
        .filter(lfs => lfs.isFile && !lfs.getPath.getName.contains("_SUCCESS"))
        .map(_.getPath).toList

      println(
        s"""
           |Total files in HDFS location  : ${hdfsFiles.length}
           | ${hdfsFiles.length > 1}
           |""".stripMargin)

      if(hdfsFiles.length > 1) {

          println(
            s"""
               |Merge Small Files
               |==============================================
               |HDFS Path             : ${hdfsPath}
               |Total Available files : ${hdfsFiles.length}
               |Status                : Running
               |
               |""".stripMargin)

          val df = spark.read.format(configs.writeFormat).load(hdfsPath).cache()
          df.repartition(1)
            .write
            .format(configs.writeFormat)
            .mode("overwrite")
            .save(s"/tmp${hdfsPath}")

          df.cache().unpersist()

        spark
          .read
          .format(configs.writeFormat)
          .load(s"/tmp${hdfsPath}")
          .write
          .format(configs.writeFormat)
          .mode("overwrite")
          .save(hdfsPath)

          Processed.ts = Processed.ts.plusHours(1).toDateTime("yyyy-MM-dd'T'HH:00:00")
          println(
            s"""
               |Merge Small Files
               |==============================================
               |HDFS Path             : ${hdfsPath}
               |Total files           : ${hdfsFiles.length}
               |Status                : Completed
               |
               |""".stripMargin)
        }
      }
    }
  }
  def apply(config: Config,spark: SparkSession): AppListener = new AppListener(config,spark)
}

object Processed {
  var ts: DateTime = DateTime.now(DateTimeZone.forID("UTC")).toDateTime("yyyy-MM-dd'T'HH:00:00")
}

Sometime data is huge & I have divided data into multiple files using below logic. File size will be around ~160 MB

val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
    val dataSize = bytes.toLong
    val numPartitions = (bytes.toLong./(1024.0)./(1024.0)./(10240)).ceil.toInt

    df.repartition(if(numPartitions == 0) 1 else numPartitions)
      .[...]

Edit-1

Using this - spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes we can get the size of actual Dataframe once its loaded into memory, for example you can check below code.

scala> val df = spark.read.format("orc").load("/tmp/srinivas/")
df: org.apache.spark.sql.DataFrame = [channelGrouping: string, clientId: string ... 75 more fields]

scala> import org.apache.commons.io.FileUtils
import org.apache.commons.io.FileUtils

scala> val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes
bytes: BigInt = 763275709

scala> FileUtils.byteCountToDisplaySize(bytes.toLong)
res5: String = 727 MB

scala> import sys.process._
import sys.process._

scala> "hdfs dfs -ls -h /tmp/srinivas/".!
Found 2 items
-rw-r-----   3 svcmxns hdfs          0 2020-04-20 01:46 /tmp/srinivas/_SUCCESS
-rw-r-----   3 svcmxns hdfs    727.4 M 2020-04-20 01:46 /tmp/srinivas/part-00000-9d0b72ea-f617-4092-ae27-d36400c17917-c000.snappy.orc
res6: Int = 0

We can use that to estimate size of data in Dataframe. Please note sometimes this will give you higher size than actual. I have updated this in answer. Please check. — Srinivas, Apr 20 '20 at 02:09
I would like to replicate the line:- ```spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats(spark.sessionState.conf).sizeInBytes``` in pyspark. Any suggestions? — Bob, Jun 05 '20 at 15:39
for spark 2.3 - val bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes more details check my other post here - https://stackoverflow.com/questions/61338374/how-to-calculate-size-of-dataframe-in-spark-scala/61338455#61338455 — Srinivas, Jun 06 '20 at 16:46
This doesn't work in Pyspark. When using pyspark I get the following error:- ```>>> bytes = spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats.sizeInBytes Traceback (most recent call last): File "", line 1, in AttributeError: 'SparkSession' object has no attribute 'sessionState' >>> ``` — Bob, Jun 07 '20 at 19:01

score 5 · Answer 2 · answered Jun 10 '19 at 10:31

5

We had a similar problem, too. After a lot of Googling, it seemed the generally accepted way was to write another job that every so often aggregates the many small files and writes them elsewhere in larger, consolidated files. This is what we now do.

As an aside: there is a limit to what you can do here anyway as the more parallelism you have, the greater the number of files because each executor thread writes to its own file. They never write to a shared file. This appears to be the nature of the beast that is parallel processing.

answered Jun 10 '19 at 10:31

Fenris

51
1

thank you for quick reply , but how to to aggregate these into bigger files , any strategy ? can you share some code snippet for how to aggregate these files ? – BdEngineer Jun 10 '19 at 10:35
Just read them in and write them out to a new directory - it really is as simple as that. Note that the written files will also reflect the parallelism I mention in my answer. – Fenris Jun 10 '19 at 11:01
how we need to partionBy after reading it again ? any better strategy for production ? we cant go by any column as the data is still being streaming. – BdEngineer Jun 10 '19 at 11:04
Why do you need to partitionBy again? You read the data in and how it is written out reflects how it was in-memory in Spark not how it was originally in HDFS. – Fenris Jun 10 '19 at 11:07
my archetecture is something like this ...source --> Kafka topic-> 1. write parquet & 2. process as structure stream ---> 1. write proccessed data into parquet & 2. cassandra db..... In this flow support team want to query data using Hive i.e. parquet.... .. how to handle this if it has many small files ? Hive reads partition By partition right ? – BdEngineer Jun 10 '19 at 17:48
but how to to aggregate these into bigger files? should consider size of each file or number of files or any batch size ? what is the production grade solution/way ? – BdEngineer Jun 11 '19 at 06:36

score 1 · Answer 3 · answered Jul 26 '20 at 15:55

This is a common burning question of spark streaming with no any fixed answer. I took an unconventional approach which is based on idea of append. As you are using spark 2.4.1, this solution will be helpful.

So, if append were supported in columnar file format like parquet or orc, it would have been just easier as the new data could be appended in same file and file size can get on bigger and bigger after every micro-batch. However, as it is not supported, I took versioning approach to achieve this. After every micro-batch, the data is produced with a version partition. e.g.

/prod/mobility/cdr_data/date=01–01–2010/version=12345/file1.parquet
/prod/mobility/cdr_data/date=01–01–2010/version=23456/file1.parquet

What we can do is that, in every micro-batch, read the old version data, union it with the new streaming data and write it again at the same path with new version. Then, delete old versions. In this way after every micro-batch, there will be a single version and single file in every partition. The size of files in each partition will keep on growing and get bigger.

As union of streaming dataset and static dataset isn't allowed, we can use forEachBatch sink (available in spark >=2.4.0) to convert streaming dataset to static dataset.

I have described how to achieve this optimally in the link. You might want to have a look. https://medium.com/@kumar.rahul.nitk/solving-small-file-problem-in-spark-structured-streaming-a-versioning-approach-73a0153a0a

append appears to be supported for parquet [here are some examples](https://delta.io/blog/2022-11-01-pyspark-save-mode-append-overwrite-error/) — ecoe, May 24 '23 at 15:08

Eugene Lopatkin · Answer 4 · 2023-02-21T07:20:28.640

0

You can set a trigger.

df.writeStream
  .format("parquet")
  .option("checkpointLocation", "path/to/checkpoint/dir")
  .option("path", "path/to/destination/dir")
  .trigger(Trigger.ProcessingTime("30 seconds"))
  .start()

The larger the trigger size, the larger the file size. Or optionally you could run the job with a scheduler(e.g. Airflow) and a trigger Trigger.Once() or better Trigger.AvailableNow(). It runs a the job only once a period and process all data with appropriate file size.

edited Feb 21 '23 at 07:20

answered Feb 20 '23 at 08:55

Eugene Lopatkin

2,351
1
22
34

even with 30 seconds you get ~3k files/day... – Alex Ott Feb 20 '23 at 09:48
Make trigger 30 days and i'm sure you, it would be bigger. Btw you should chose appropriate interval. I just provide a solution. – Eugene Lopatkin Feb 20 '23 at 18:03
It was more a comment that parquet isn’t the best format to use on its own in 2023rd :-) – Alex Ott Feb 20 '23 at 18:35
But we're not JS developers to change format every year.) – Eugene Lopatkin Feb 21 '23 at 06:24
There are easy to upgrade alternatives that doesn’t have such problems. Like Delta – Alex Ott Feb 21 '23 at 06:43

How to handle small file problem in spark structured streaming?

4 Answers4