Create dynamic query with multiple joins in HDFS

Question

My use case is that I want to create a reporting tool with around 200 tables each having millions of row and 100s of columns. There will be multiple joins here between the tables to finally create a report. The user will have multiple fields to select and create a report out of it. So, the query will be generated at runtime. I want to understand, what could be the best possible Big Data technology that can be used for this purpose. Current RDBMS may not be able to scale at such high volume of data. We can dump all the data on to HDFS, but how do we implement the joins on it, such that the performance of the reporting application doesn't get affected too much. Any real implementation or links or paper with similar kind of a use case will me help big time.

If the data is in HDFS, you can use the Phoenix Bulkloader to move it into HBase. The trick with Phoenix is you want to JOINs and not SELECT from multiple tables. I do recommend that you carefully construct and `EXPLAIN` and carefully analyze your covered indexes in Phoenix/HBase. — Paul Bastide, Nov 01 '17 at 10:37
Thanks for the info Paul. Well I was also evaluating Phoenix with HBase. Just to take this forward, do you suggest creating 200 different tables on HBase or normalize the data and create few tables for it? Currently we are creating 200 different tables but when there are large joins and many tables are involved in it, the performance of the query is taking a hit. Sincere thanks again for the help. — Aditya, Nov 01 '17 at 14:08
create very few tables, at least that's my best experience. currently we have 200 tables loaded, and use 7 tables max to do a join — Paul Bastide, Nov 01 '17 at 15:59

Create dynamic query with multiple joins in HDFS

0 Answers0