How to find certain column names in parquet directories

Question

There are multiple directories as below. I'm trying to automatically read all the parquet files and see if any of their column names contains string "prodcolor". One thing is that not all the directories have parquet files, and there are multiple directories after this part hdfs://user/hive/warehouse/, like one.db, two.db and three.db.

hdfs://user/hive/warehouse/one.db/table1/  --- these have _SUCCESS and .parquet
hdfs://user/hive/warehouse/one.db/table2/
hdfs://user/hive/warehouse/one.db/some/somefile.txt ---these do not
hdfs://user/hive/warehouse/two.db/table3/--- these have .parquet as well

I know that once we get to read a parquet file, we can get the column names like

df= spart.read.parquet("hdfs://user/hive/warehouse/one.db/table1/")
df.columns

But how to automatically check the directories from PySpark job without other extra libraries? If there is a way to query hive meta data directly it'd be great too, without having to explicitly know the table names or establish a JDBC connections. Thanks so much for your help. Preferably this can be done in Python.

Have you checked `spark.catalog.listTables` , `spark.catalog.listColumns` and `spark.read.tables("mydb.tableNameHere").printSchema`? — ccheneson, Aug 25 '21 at 06:55
What happens if you run `spart.read.parquet("hdfs://user/hive/warehouse/one.db/some/")`? Do you get exception? Then you can catch it and ignore this folder. Or you can list files in the folder and check that it's either _SUCCESS or .parquet — Yaroslav Fyodorov, Aug 25 '21 at 08:01
@YaroslavFyodorov sorry I edited the question to explain it better. The thing is there are other directories after `/warehouse` like `/two.db` so I do not know the path in advance — user3735871, Aug 25 '21 at 09:57
This seems relevant https://stackoverflow.com/questions/35750614/pyspark-get-list-of-files-directories-on-hdfs-path. Afaik there are no real shortcuts - you start from folder, list all files below (if it's a folder) and so on recursively. You will have to use extra libraries to work with hdfs. In the worst case you can run shell command from python/scala to list hdfs folder content, only its output is ugly and you will have to parse it — Yaroslav Fyodorov, Aug 25 '21 at 10:03
@ccheneson thanks. I edited the question to explain this better. I tried `spark.catalog.listTables` but we need to know the database names as well right? — user3735871, Aug 25 '21 at 10:14
You have the list of databases with `spark.catalog.listDatabases` — ccheneson, Aug 25 '21 at 11:27

How to find certain column names in parquet directories

0 Answers0