Like to share my requirement, and how best it can be solved.
I have an SQL query, say, "SQL_QUERY_RUNS_AND_GIVES_RESULT_SET" which runs and passes the result set to a dataframe. Since the result set is huge, I create several partitions out of it and save the CSV files in a folder called "/tmp/CSV_FILE_NAME.csv". I have given a sample list of partitioned CSV files that are getting generated in the above folder.
df_3 = sqlContext.sql("""
SQL_QUERY_RUNS_AND_GIVES_RESULT_SET
""")
rowsPerPartition = 10000
partitions = int(1 + df_3.count() / rowsPerPartition)
df_3.repartition(numPartitions=partitions).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").option("delimiter", ";").save("/tmp/CSV_FILE_NAME.csv")
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00002-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-169-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00001-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-168-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00003-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-170-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00000-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-167-1-c000.csv
Now, my requirement is to create a table for each of this partitioned csv file. Since, I do not know how many partitioned csv files can result, the number of tables will be dynamic in number. Also, the table name can be made simpler like : CSV_FILE_NAME_{part-0001} That is, I like to pick just the first 2 words from each partitioned csv file. And, the tables must be recreated every time I execute this while code snippet in notebook.
The reason for doing this : While the partitioned files are being imported into some other technology, some are failing, and not all rows are being imported. So, I want to focus my analysis on the partition that fails. If I create a table out of the data frame that executes the SQL query, it has become tedious to identify the root-cause. Since I am re-creating the tables each time code snippet runs, there is not an issue with storage being consumed.
I think the below command helps in creating a table but was not sure exactly about the looping to be done for my requirement. Appreciate any mindshare on this.
# Create the table.
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'")