0

I am facing below issue while writing the data in Delta location. I am getting incorrect data. I am using Python Notebook in Azure Databricks. Dataset Used : /databricks-datasets/flights/

Below are the steps I performed.

Mount to blob storage.

dbutils.fs.mount(
    source = "wasbs://flight@storageaccount7830.blob.core.windows.net",
    mount_point = "/mnt/blobstorage",
    extra_configs = {'fs.azure.account.key.storageaccount7830.blob.core.windows.net':'accountKey'}
)

Save it in the Delta Table

fulldf = spark.read.format("csv").option("header",True).option("inferSchema",True).load('/databricks-datasets/flights/')
 
fulldf.write.format("delta").mode("overwrite").save("/mnt/blobstorage/Full")

Select the data to limit 10

df=fulldf.limit(10)
df.write.format("delta").mode("overwrite").save("/mnt/blobstorage/Small")

Display the data. I am getting the output like this.

%sql
select * from delta.`/mnt/blobstorage/Small`

enter image description here

Output should be like this.

enter image description here

Baxy
  • 139
  • 1
  • 13

1 Answers1

0

the right thing is for you to use a schema instead of making the inference, and another thing seems to me that the CSV files in the load directory are positional, so you would need to pass the option like this

from pyspark.sql.types import StructType,StructField, StringType

schema = StructType([ \
    StructField("date",StringType(),True), \
    StructField("delay",StringType(),True), \
    StructField("distance",StringType(),True), \
    StructField("origin", StringType(), True), \
    StructField("destination", StringType(), True), \
  ])

fulldf = spark\
        .read\
        .format("csv")\
        .option("delimiter","\t")\
        .option("header",false)\
        .schema(schema)\
        .option("path","/databricks-datasets/flights/")\
        .load()

df.limit(10).write.format("delta").mode("overwrite").save("/mnt/blobstorage/Small")