2

I have a problem where I need to create an external table in Databricks for each CSV file that lands into an ADLS gen 2 storage.

I thought about a solution when I would get a streaming dataframe from dbutils.fs.ls() output and then call a function that creates a table inside the forEachBatch().

I have the function ready, but I can't figure out a way to stream directory information into a streaming Dataframe. Do anyone have an idea on how this could be achieved?

LeandroHumb
  • 843
  • 8
  • 23
  • do you really need to have a separate table for individual CSV file? – Alex Ott Mar 30 '22 at 18:12
  • Yes, because they are completely different files, for example: they will upload a file with users, and another file with cars, then I need to register a table called users and another table called cars – LeandroHumb Mar 31 '22 at 11:23

1 Answers1

-1

Kindly check with the below code block.

package com.sparkbyexamples.spark.streaming
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

object SparkStreamingFromDirectory {

  def main(args: Array[String]): Unit = {

    val spark:SparkSession = SparkSession.builder()
      .master("local[3]")
      .appName("SparkByExamples")
      .getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

    val schema = StructType(
      List(
        StructField("Zipcode", IntegerType, true),
        
      )
    )

    val df = spark.readStream
      .schema(schema)
      .json("Your directory")

    df.printSchema()

    val groupDF = df.select("Zipcode")
        .groupBy("Zipcode").count()
    groupDF.printSchema()

    groupDF.writeStream
      .format("console")
      .outputMode("complete")
      .start()
      .awaitTermination()
  }
}
Sairam Tadepalli
  • 1,563
  • 1
  • 3
  • 11