OK, here it is. In Scala (saw too late), you should be able to convert easily to your requirements (and to pyspark).
Not a Big Data solution imho doing things sequentially. You could process sub-sets if order of data is not important but I think it is for Delta Lake to reflect time-travel.
No Serialization issues if using a for loop from Driver. I just re-checked looping on Databricks Community Edition which should give most Serialization issues if they exist.
You will need to tailor the file listings as these are for Databricks.
Code & adapt accordingly:
val allFiles=dbutils.fs.ls("dbfs:/FileStore/tables/").map(_.path).filter( name => name.contains("x1") ).toList
for( file <- allFiles) {
println("File : "+ file);
val df = spark.read.text(file)
df.show(false)
df.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("SOx")
}
val df2 = spark.table("SOx")
df2.show(false)
Returns:
File : dbfs:/FileStore/tables/x111.txt
+-----+
|value|
+-----+
|1 |
|2 |
|2 |
|2 |
|2 |
|2 |
+-----+
File : dbfs:/FileStore/tables/x112.txt
+-----+
|value|
+-----+
|1 |
|2 |
|2 |
|2 |
|3 |
|33 |
|33 |
+-----+
File : dbfs:/FileStore/tables/x113.txt
+-----+
|value|
+-----+
|777 |
|777 |
|666 |
+-----+
+-----+
|value|
+-----+
|1 |
|2 |
|2 |
|2 |
|3 |
|33 |
|33 |
|777 |
|777 |
|666 |
|1 |
|2 |
|2 |
|2 |
|2 |
|2 |
+-----+
Of course, a parquet file can have N parts. Unclear what you mean in this regard, but we cannot process the individual partition file of the parquet file.
The following approach does work where I save in this case 2 tables with parquet format files. Note I look for the path to that table and get all partition files for the parquet table. I am not sure if this is what you want.
val allPaths=dbutils.fs.ls("/user/hive/warehouse").map(_.path).filter( name => name.contains("/sox") ).toList
for( file <- allPaths) {
println("File : "+ file);
val df = spark.read.parquet(file)
df.show(false)
df.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("SOx3")
}
I cannot get the individual file aspect to work. I have never tried to
indvidually process the parts to a file, I suspect / expect we should
not do this. That makes sense to me. So, it may not be possible to do what you want. Unless the partition level adopted.
pyspark approach:
%python
allPaths=dbutils.fs.ls("/user/hive/warehouse")
allPathsName = map(lambda x:(x[0]),allPaths)
allPathsFiltered = [s for s in allPathsName if "/sox" in s]
for file in allPathsFiltered:
print(file)
df = spark.read.parquet(file)
df.show()
df.write.mode("append").format("parquet").saveAsTable("SOx3")
Per partition approach is possible - here in Scala, you can convert to python, pyspark:
val allPaths=dbutils.fs.ls("/user/hive/warehouse/").map(_.path).filter( name => name.contains("sox") ).toList
for( file <- allPaths) {
println("File : "+ file)
val allParts=dbutils.fs.ls(file).map(_.path).filter( name => name.contains("firstname=") ).toList
println("Part : "+ allParts)
for( part <- allParts) {
val df = spark.read.parquet(part)
df.show(false)
df.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).saveAsTable("SOx4")
}
}
As a final comment I would suggest a bigger cluster.