-2

I'm looking for metadata table which holds all column name, table names, creation timestamps within spark sql and delta lake. I need to be able to search by a given column name and list all the tables having that column name.

Mauryas
  • 41
  • 2
  • 8
  • 1
    what have you code so far? what error are you getting ? – Gaurang Shah Sep 08 '19 at 15:20
  • I donot know a metadata table which can provide this and thats what I am simply asking for..there is no reason for -1 unless you believe this has already been answered somewhere on the forum. – Mauryas Sep 08 '19 at 18:56

1 Answers1

0

This doesn't exist in baseline Spark. For this you would need to create an internal ABaC process that gathers specific metadata on process runs. For last update time you can parse the timestamp of an object in hadoop when you run a 'hadoop fs -ls' command; column names would require having a process run, recursively, a 'hive -e' while inputing 'show create table' then parsing out the header and footer; and to get all table names use previous tactic but running 'show tables'. If you have a robust Yarn server running all the code you can get start and end time of jobs, but it generally a nightmare to work with.

afeldman
  • 492
  • 2
  • 10
  • 1
    Thanks. After some research i have created my own process which recursively runs across all db(show databases) and then gathers its tables details(show detail tb) and finally all columns under it(show table).it ran exhaustively. But did the task. Need to determine a way to maintain it and grow incrementally and not rebuild exhaustively everytime as our env is growing very fast. – Mauryas Sep 11 '19 at 19:42
  • Save these outputs into a table. On run time do a join between tables already processed versus new tables. Adds a little overhead with the join but in the end you would only collect column names of the tables that are new. – afeldman Sep 11 '19 at 20:02