0

As an example consider I have a data of all the major sports events happened.Schema given below

EventName,Date,Month,Year,City

This data that is physically structured in HDFS on year,date,month.

Now I want to create virtual partitions on that based on some other column value, eg. City.The data will be stored physically in HDFS in year,date,month structure only but my metadata keeps track of the virtual partition.

Can hive metastore do it for me?

anmolp95
  • 11
  • 2

1 Answers1

0

I don't think so it will happen. Actually partitioning in Hive means creates different dir for different partition. And metastore only contains metadata of table. It won't control the actual data. Technically when ever we query based on that partitioned column in Hive table, the query will execute on that exact partitioned dir only. So virtual partitioning with out changing hdfs structure in the sense the real data will be in one dir so the query has to be execute on entire data. So technically optimisation is not at all happening.

Suresh Kumar
  • 71
  • 1
  • 6
  • I agree to your point , but isn't it possible that we have metadata for all files associated to each partition. So when we query on a virtual partition the metastore will just provide list of files to process, hence we don't have to search on the entire structure ourselves with metastore will keep it precomputed for us. – anmolp95 Apr 18 '18 at 20:16
  • In your way if we create virtual partitions for each n every partition dir, the metadata will increase so burden on Namenode will increase causes performance issues. N business requirements may be changes time to time so we can't create different metadata for different requirement on same dataset. It's very expensive. In normal partitioning also same thing will happen if u query with partition column. Metadata contains by which column partitioning happened so the query will run on respective list of files in that partition only. – Suresh Kumar Apr 19 '18 at 11:06