0

Like to share my requirement, and how best it can be solved.

I have an SQL query, say, "SQL_QUERY_RUNS_AND_GIVES_RESULT_SET" which runs and passes the result set to a dataframe. Since the result set is huge, I create several partitions out of it and save the CSV files in a folder called "/tmp/CSV_FILE_NAME.csv". I have given a sample list of partitioned CSV files that are getting generated in the above folder.

df_3 = sqlContext.sql("""
SQL_QUERY_RUNS_AND_GIVES_RESULT_SET
""")
rowsPerPartition = 10000
partitions = int(1 + df_3.count() / rowsPerPartition)
df_3.repartition(numPartitions=partitions).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").option("delimiter", ";").save("/tmp/CSV_FILE_NAME.csv")


File /dbfs/tmp/CSV_FILE_NAME.csv/part-00002-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-169-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00001-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-168-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00003-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-170-1-c000.csv
File /dbfs/tmp/CSV_FILE_NAME.csv/part-00000-tid-7732773626680141509-68331d20-9bec-403d-8e18-cf3c1009ad25-167-1-c000.csv

Now, my requirement is to create a table for each of this partitioned csv file. Since, I do not know how many partitioned csv files can result, the number of tables will be dynamic in number. Also, the table name can be made simpler like : CSV_FILE_NAME_{part-0001} That is, I like to pick just the first 2 words from each partitioned csv file. And, the tables must be recreated every time I execute this while code snippet in notebook.

The reason for doing this : While the partitioned files are being imported into some other technology, some are failing, and not all rows are being imported. So, I want to focus my analysis on the partition that fails. If I create a table out of the data frame that executes the SQL query, it has become tedious to identify the root-cause. Since I am re-creating the tables each time code snippet runs, there is not an issue with storage being consumed.

I think the below command helps in creating a table but was not sure exactly about the looping to be done for my requirement. Appreciate any mindshare on this.

# Create the table.
spark.sql("CREATE TABLE " + table_name + " USING DELTA LOCATION '" + save_path + "'") 
Alex Ott
  • 80,552
  • 8
  • 87
  • 132

1 Answers1

1

Create a column to your dataset, and set the value with the partition number. For example,

row partition
data 1
data 1
data 2
...

and then use partitionBy to create the csv.

df.write.partitionBy('partition').csv('test')

The result will be saved as follows:

.
└── test
    ├── partition=1
    │   └── part-00000-03446c22-782d-49cf-bbe0-7b702dd03309.c000.csv
    ├── partition=2
    │   └── part-00000-03446c22-782d-49cf-bbe0-7b702dd03309.c000.csv
    └── _SUCCESS

Then, you can fix the path to the partition folder and create tables by for loop once not everytime.

Lamanus
  • 12,898
  • 4
  • 21
  • 47
  • Hi Lamanus,Can you guide on how to save into tables based on my initial requirement? I do not want to use the partition by approach at the moment when the whole notebook is working fine. Thank you. – Noor Basha Shaik Nov 27 '21 at 09:30