There are multiple directories as below. I'm trying to automatically read all the parquet files and see if any of their column names contains string "prodcolor". One thing is that not all the directories have parquet files, and there are multiple directories after this part hdfs://user/hive/warehouse/
, like one.db
, two.db
and three.db
.
hdfs://user/hive/warehouse/one.db/table1/ --- these have _SUCCESS and .parquet
hdfs://user/hive/warehouse/one.db/table2/
hdfs://user/hive/warehouse/one.db/some/somefile.txt ---these do not
hdfs://user/hive/warehouse/two.db/table3/--- these have .parquet as well
I know that once we get to read a parquet file, we can get the column names like
df= spart.read.parquet("hdfs://user/hive/warehouse/one.db/table1/")
df.columns
But how to automatically check the directories from PySpark job without other extra libraries? If there is a way to query hive meta data directly it'd be great too, without having to explicitly know the table names or establish a JDBC connections. Thanks so much for your help. Preferably this can be done in Python.