2

I have to write a Spark dataframe in the path of the format: base_path/{year}/{month}/{day}/{hour}/ If I do something like below:

pc = ["year", "month", "day", "hour"]
df.write.partitionBy(*pc).parquet("base_path/", mode = 'append')

It creates the location as: base_path/year=2022/month=04/day=25/hour=10/. I do not want the column names like year, month, day and hour to be the part of path but something like: base_path/2022/04/25/10/. Any solution for this?

seou1
  • 446
  • 1
  • 5
  • 21

1 Answers1

-1

The column names are written as part of the path because they are not written in the object itself so you need the column name in the path in order to be able to read it back (following hive style convention).
For more information about this see here.

If you would still want to write the data with the above path you can use multiple write commands with the explicit path and filter according to the partition values.
The current logic for determining the partition path is located here and there doesn't seem to be a way to replace this in a pluggable way (you could technically load a different implementation in the JVM or write you own writer implementation but I would not recommend that)

Guy
  • 124
  • 6
  • Or let's say if I want the filename or path to be taken from the column, how would I do that without using partition by? E.g there is a column called 'path' having data such as: 2022/04/25/10 and while .write, how do I specify that save to the location mentioned in column 'path'? – seou1 May 01 '22 at 12:22
  • I am not sure if there is a nice way to do so. One option you have is to add the columns explicitly so something like: ``` df = spark.read.parquet("s3://path/2022/04/25/10/") df = df.withColumn("year", lit(2022)) ``` you can parse the path to make sure you add all relevant columns. and then if you have multiple paths you can use [union](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.union.html) to merge the dataframes – Guy May 01 '22 at 12:38
  • I am actually asking to use the path column in dataframe as a location of output to pass in write.parquet(location_column) – seou1 May 03 '22 at 16:28