0

We have a source of files each of from few MB to few GB in size. Each file is uniquely named and could be mapped to a person. However person information comes from different sources but it is not in the file system.

Now, we have a requirement to move all files to HDFS and build UI to add person information to the file and search for files based on person information later.

I am thinking to move files using WebHDFS (so that we could secure the cluster using knox) every night and build UI to add person information to the HBase and link person to the appropriate file (User could map file name with the person). Each HBase record will have the person information and the path of the hdfs file.

I am wondering if the above architecture has any bad implications. Is it okay to have HDFS file paths in the HBase records?

user3600073
  • 1,773
  • 3
  • 18
  • 21
  • Are you sure you need HBase for that? Won't everything fit to a regular database (e.g. MySQL)? – facha Feb 02 '16 at 22:17
  • @facha Person data would be different depending on the source. So, we considered Mongo first. However, we thought HBase could be helpful if we want to implement analytic usecases that require both files and person information. – user3600073 Feb 02 '16 at 22:29
  • if people count in tables will not more than a million, i think mongodb is easy way to search based on different search fields.you describe a typical json format data. – halil Feb 03 '16 at 07:31
  • If HBase is a good choise or not, depends on how you are going to use it. HBase performs well on "find a needle in a haystack" type of queries (lookups of a single row of data). HBase performs poorly on analytical queries (where you need to scan all dataset and aggregate it in some way). – facha Feb 03 '16 at 07:41
  • thanks for the response. Is it usual to have hdfs file paths in some databases that are used for oltp? – user3600073 Feb 03 '16 at 12:42

0 Answers0