We have a source of files each of from few MB to few GB in size. Each file is uniquely named and could be mapped to a person. However person information comes from different sources but it is not in the file system.
Now, we have a requirement to move all files to HDFS and build UI to add person information to the file and search for files based on person information later.
I am thinking to move files using WebHDFS (so that we could secure the cluster using knox) every night and build UI to add person information to the HBase and link person to the appropriate file (User could map file name with the person). Each HBase record will have the person information and the path of the hdfs file.
I am wondering if the above architecture has any bad implications. Is it okay to have HDFS file paths in the HBase records?