1

I have multiple parquet files (around 1000). I need to load each one of them, and save the result to a delta table. I was thinking of doing through for loop in pyspark but not finding any leads to do it. I am using spark 3.0.

file name example = part-00000-tid-3509510096971042864-b633a465-b62f-45b5-a5c9-61af55de541a-6264-1-c000.snappy.parquet

Similarly I have 1000 partition files. Any help would be appreciated.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • why do it that way? chronological issues? – thebluephantom Jun 13 '21 at 17:26
  • does [this](https://stackoverflow.com/questions/57983027/how-to-convert-parquet-to-spark-delta-lake) answer your question? – VectorXY Jun 13 '21 at 23:35
  • @thebluephantom I am trying to read the entire 1000 files into a data frame and then trying to write to a table in spark but it fails with java heap space memory issues. So as the data set is large I was thinking of loading one file at a time in a for loop so that I can load the entire dataset. I am not able to figure out how to loop through the files and load one at a time. df_temp = spark.read.format("PARQUET").load("locationof file") df_temp.write.format("delta").mode("OVERWRITE").option("overwriteSchema", "true").save("table") – SHIBASHISH TRIPATHY Jun 14 '21 at 03:55
  • @VectorXY No I am actually looking for loading those files into a delta table as mentioned in the above comments. – SHIBASHISH TRIPATHY Jun 14 '21 at 03:57
  • time travel how then? – thebluephantom Jun 14 '21 at 05:48
  • surely order of processing is important? – thebluephantom Jun 14 '21 at 05:52
  • @thebluephantom the order of processing is not important but the duplicity should be avoided. Like if we load first parquet file it should not process again and what do you mean by time travel I think this is the concept used for delta lake not sure, I already have 1000 parquet files in ADLS and need to load that in a table. – SHIBASHISH TRIPATHY Jun 14 '21 at 06:17
  • what do u mean with delta? not delta lake? u can read but you cannot write? – thebluephantom Jun 14 '21 at 06:28
  • if u can read then you store filemaker in de and loop – thebluephantom Jun 14 '21 at 06:29
  • i will look tonight. it can be done. – thebluephantom Jun 14 '21 at 06:30
  • Thanks @thebluephantom Delta means I have created a table with format as Delta not a delta lake. am able to read it in a dataframe but when trying to write its is taking eternity and load is failing as we have around 300 TB of data. I wanted to make an automation process where it picks one file at a time and loads in the table and making sure the file which loaded should not process again. – SHIBASHISH TRIPATHY Jun 14 '21 at 07:17
  • But delta is for felta lake or am i mistaken – thebluephantom Jun 14 '21 at 07:20
  • Delta lake is different and creating table as delta format is different.Table just use the delta capabilities like update ,merge,delta logs but it is just like any other transcational table. I wanted to load all the files one by one into that table,that's the use case – SHIBASHISH TRIPATHY Jun 14 '21 at 08:15
  • Mmm interesting – thebluephantom Jun 14 '21 at 20:42
  • One clarification when we create a delta table it stores as a delta lake on top of my ADLS ,so delta lake and delta format are same,my mistake.I want to Loop each parquet file and write into delta lake table.Can you help? – SHIBASHISH TRIPATHY Jun 15 '21 at 03:07
  • @thebluephantom I tried out few things but no luck,did you find anything?? – SHIBASHISH TRIPATHY Jun 19 '21 at 12:47
  • I forgot. I will look later – thebluephantom Jun 19 '21 at 13:36
  • I found all things. – thebluephantom Jun 20 '21 at 14:45

1 Answers1

1

OK, here it is. In Scala (saw too late), you should be able to convert easily to your requirements (and to pyspark).

Not a Big Data solution imho doing things sequentially. You could process sub-sets if order of data is not important but I think it is for Delta Lake to reflect time-travel.

No Serialization issues if using a for loop from Driver. I just re-checked looping on Databricks Community Edition which should give most Serialization issues if they exist.

You will need to tailor the file listings as these are for Databricks.

Code & adapt accordingly:

val allFiles=dbutils.fs.ls("dbfs:/FileStore/tables/").map(_.path).filter( name => name.contains("x1") ).toList

for( file <- allFiles) {
     println("File : "+ file);
     val df = spark.read.text(file)
     df.show(false)
     df.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("SOx")
}

val df2 = spark.table("SOx")
df2.show(false)

Returns:

File : dbfs:/FileStore/tables/x111.txt
+-----+
|value|
+-----+
|1    |
|2    |
|2    |
|2    |
|2    |
|2    |
+-----+

File : dbfs:/FileStore/tables/x112.txt
+-----+
|value|
+-----+
|1    |
|2    |
|2    |
|2    |
|3    |
|33   |
|33   |
+-----+

File : dbfs:/FileStore/tables/x113.txt
+-----+
|value|
+-----+
|777  |
|777  |
|666  |
+-----+


+-----+
|value|
+-----+
|1    |
|2    |
|2    |
|2    |
|3    |
|33   |
|33   |
|777  |
|777  |
|666  |
|1    |
|2    |
|2    |
|2    |
|2    |
|2    |
+-----+

Of course, a parquet file can have N parts. Unclear what you mean in this regard, but we cannot process the individual partition file of the parquet file.

The following approach does work where I save in this case 2 tables with parquet format files. Note I look for the path to that table and get all partition files for the parquet table. I am not sure if this is what you want.

val allPaths=dbutils.fs.ls("/user/hive/warehouse").map(_.path).filter( name => name.contains("/sox") ).toList

for( file <- allPaths) {
     println("File : "+ file);
     val df = spark.read.parquet(file)
     df.show(false)
     df.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("SOx3")
}

I cannot get the individual file aspect to work. I have never tried to indvidually process the parts to a file, I suspect / expect we should not do this. That makes sense to me. So, it may not be possible to do what you want. Unless the partition level adopted.

pyspark approach:

%python
allPaths=dbutils.fs.ls("/user/hive/warehouse")
allPathsName = map(lambda x:(x[0]),allPaths)
allPathsFiltered = [s for s in allPathsName if "/sox" in s]


for file in allPathsFiltered:
  print(file)
  df = spark.read.parquet(file)
  df.show()
  df.write.mode("append").format("parquet").saveAsTable("SOx3") 

Per partition approach is possible - here in Scala, you can convert to python, pyspark:

val allPaths=dbutils.fs.ls("/user/hive/warehouse/").map(_.path).filter( name => name.contains("sox") ).toList

for( file <- allPaths) {
     println("File : "+ file)
     val allParts=dbutils.fs.ls(file).map(_.path).filter( name => name.contains("firstname=") ).toList
     println("Part : "+ allParts)
     for( part <- allParts) {
          val df = spark.read.parquet(part)
          df.show(false)
          df.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("SOx4")
     }  
}

As a final comment I would suggest a bigger cluster.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • @thebluphantom Thank you soo much,you gave me the vision for approaching the problem,I did tried getting the bigger cluster and it solved my issue – SHIBASHISH TRIPATHY Jun 21 '21 at 17:52