PySpark: how to read in partitioning columns when reading parquet

Question

I have data stored in a parquet files and hive table partitioned by year, month, day. Thus, each parquet file is stored in /table_name/year/month/day/ folder.

I want to read in data for only some of the partitions. I have list of paths to individual partitions as follows:

paths_to_files = ['hdfs://data/table_name/2018/10/29',
                  'hdfs://data/table_name/2018/10/30']

And then try to do something like:

df = sqlContext.read.format("parquet").load(paths_to_files)

However, then my data does not include the information about year, month and day, as this is not part of the data per se, rather the information is stored in the path to the file.

I could use sql context and a send hive query with some select statement with where on the year, month and day columns to select only data from partitions i am interested in. However, i'd rather avoid constructing SQL query in python as I am very lazy and don't like reading SQL.

I have two questions:

what is the optimal way (performance-wise) to read in the data stored as parquet, where information about year, month, day is not present in the parquet file, but is only included in the path to the file? (either send hive query using sqlContext.sql('...'), or use read.parquet,... anything really.
Can i somehow extract the partitioning columns when using the approach i outlined above?

Hi, did you have a look at adding an additional column with the filename with the function `input_file_name()` for each file with something like: https://stackoverflow.com/questions/39868263/spark-load-data-and-add-filename-as-dataframe-column then your path should be included. You can even strip out parts of the path, e.g. month, day with a regular expression — gaw, Nov 28 '18 at 13:36
Yes, i tried that, but parsing out the year, month and day seemed a bit slow — ira, Nov 28 '18 at 14:14
I think you can add the basepath option `sqlContext.read.option("basePath", hdfs://data/table_name).format("parquet").load(paths_to_files)` and then you will get the columns you want. This is what worked for me in a slightly different usage. — Josh Herzberg, Mar 05 '21 at 19:50
Thank you @JoshHerzberg. Adding option("basePath",..) worked for me :) — Vasanth Subramanian, Jan 16 '22 at 17:29

OneCricketeer · Accepted Answer · 2018-11-30T10:43:59.550

3

Reading the direct file paths to the parent directory of the year partitions should be enough for a dataframe to determine there's partitions under it. However, it wouldn't know what to name the partitions without the directory structure /year=2018/month=10, for example.

Therefore, if you have Hive, then going via the metastore would be better because the partitions are named there, Hive stores extra useful information about your table, and then you're not reliant on knowing the direct path to the files on disk from the Spark code.

Not sure why you think you need to read/write SQL, though.

Use the Dataframe API instead, e.g

df = spark.table("table_name")
df_2018 = df.filter(df['year'] == 2018)
df_2018.show()

edited Nov 30 '18 at 10:43

answered Nov 30 '18 at 06:30

OneCricketeer

179,855
19
132
245

Unfortunately, when i try to read the path to the parent directory of year partition, i get `unable to determine the parquet schema` error. Haven't been able to find a fix for that yet. Thank you very much for the suggestion of handling it without writing sql query. – ira Nov 30 '18 at 09:59
Could you please provide some details on why exactly going via metastore would be better? – ira Nov 30 '18 at 10:00
The Hive metastore knows where the files exist and automatically determines the columns and metadata of the parquet files – OneCricketeer Nov 30 '18 at 10:04
Thank you very much! Is it also possible to use the dataframe api to write into a table? (i have external hive table, stored as parquet, paritioned by one column) – ira Nov 30 '18 at 10:13
`df.saveAsTable`, I believe – OneCricketeer Nov 30 '18 at 10:42

score -1 · Answer 2 · answered Nov 30 '18 at 10:13

Your data isn't stored in a way optimal for parquet so you'd have to load files one by one and add the dates

Alternatively, you can move the files to a directory structure fit for parquet ( e.g. .../table/year=2018/month=10/day=29/file.parquet) then you can read the parent directory (table) and filter on year, month, and day (and spark will only read the relevant directories) also you'd get these as attributes in your dataframe

PySpark: how to read in partitioning columns when reading parquet

2 Answers2