Databricks - Save partitioned CSV files into respective tables

Question

Like to share my requirement, and how best it can be solved.

I have an SQL query, say, "SQL_QUERY_RUNS_AND_GIVES_RESULT_SET" which runs and passes the result set to a dataframe. Since the result set is huge, I create several partitions out of it and save the CSV files in a folder called "/tmp/CSV_FILE_NAME.csv". I have given a sample list of partitioned CSV files that are getting generated in the above folder.

df_3 = sqlContext.sql("""
SQL_QUERY_RUNS_AND_GIVES_RESULT_SET
""")
rowsPerPartition = 10000
partitions = int(1 + df_3.count() / rowsPerPartition)
df_3.repartition(numPartitions=partitions).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").option("delimiter", ";").save("/tmp/CSV_FILE_NAME.csv")


File /dbfs/tmp/CSV_FILE_NAME.csv/part-00002-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-169-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00001-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-168-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00003-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-170-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00000-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-167-1-c000.csv

Now, my requirement is to create a table for each of this partitioned csv file. Since, I do not know how many partitioned csv files can result, the number of tables will be dynamic in number. Also, the table name can be made simpler like : CSV_FILE_NAME_{part-0001} That is, I like to pick just the first 2 words from each partitioned csv file. And, the tables must be recreated every time I execute this while code snippet in notebook.

The reason for doing this : While the partitioned files are being imported into some other technology, some are failing, and not all rows are being imported. So, I want to focus my analysis on the partition that fails. If I create a table out of the data frame that executes the SQL query, it has become tedious to identify the root-cause. Since I am re-creating the tables each time code snippet runs, there is not an issue with storage being consumed.

I think the below command helps in creating a table but was not sure exactly about the looping to be done for my requirement. Appreciate any mindshare on this.

# Create the table.
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")

score 1 · Answer 1 · answered Nov 27 '21 at 03:40

Create a column to your dataset, and set the value with the partition number. For example,

row partition
data 1
data 1
data 2
...

and then use partitionBy to create the csv.

df.write.partitionBy('partition').csv('test')

The result will be saved as follows:

.
└── test
    ├── partition=1
    │   └── part-00000-03446c22-782d-49cf-bbe0-7b702dd03309.c000.csv
    ├── partition=2
    │   └── part-00000-03446c22-782d-49cf-bbe0-7b702dd03309.c000.csv
    └── _SUCCESS

Then, you can fix the path to the partition folder and create tables by for loop once not everytime.

Hi Lamanus,Can you guide on how to save into tables based on my initial requirement? I do not want to use the partition by approach at the moment when the whole notebook is working fine. Thank you. — Noor Basha Shaik, Nov 27 '21 at 09:30

Databricks - Save partitioned CSV files into respective tables

1 Answers1