0

The scenario is I need to process a file(Input) and for each records I need to check whether certain fields in input file are matching the fields stored in an Hadoop cluster.

We are in a thought of using MRJob to process the the input file and use HIVE to get data from hadoop cluster. I would like to know whether it is possible for me to connect HIVE inside a MRJob module. If so how to do that?

If not what would be the ideal approach to fulfill my requirement.

I am new to Hadoop, MRJob and Hive.

Please provide some suggestion.

1 Answers1

0

"matching the fields stored in an Hadoop cluster." --> You mean that you need to search if the fields exists in this file too?

About how many files are there in total which you need to scan?

One solution is to load every single item in an HBase table and for every record in the input file, "GET"ing the record from the table. If the GET is successful then the record exists elsewhere in HDFS or else it doesn't. You would need a unique identifier for each HBase record and the same identifier should exist in your input file also.

You could connect to Hive also but the schema would need to be rigid in order for all your HDFS files to be able to be loaded into a single Hive table. HBase doesn't really care about columns (only ColumnFamilies needed). One more downside with MapReduce and Hive is that the speed will be low as compared to HBase (near real time).

Hope this helps.

kashmoney
  • 97
  • 7
  • Yes. For Example My input file is something like this Input file: Name,add1,add2,postcode, Mark,31 Maybush,XXX,WF1 5XY I need to check whether the value of the field 'add1' which is "31 Maybush" for Name "Mark" in inputfile matches with the data in cluster for same name. – user1703319 Nov 29 '16 at 18:22
  • Will all the files have the exact same schema? If yes then you can create a Hive table and load all the hundreds/thousands of files that you have to the new Hive table. Then you can connect to run a Hive query from MapReduce. Your Hive query will be something like this : SELECT * FROM huge_hive_table WHERE add1 = '31 Maybush' AND name = 'Mark'; Of course you will need to dynamically change 31 Maybush and Mark as each inputfile line is read. – kashmoney Nov 29 '16 at 22:37
  • The problem with the above approach is that there is a MR job for each query in Hive so an MR job will run for every line in the inputfile since we are comparing each line. If you have 2000 lines then 2000 MR jobs for comparisons. – kashmoney Nov 29 '16 at 22:39
  • Best is to use the HBase approach. It will be near real time for these kinds of things. – kashmoney Nov 29 '16 at 22:40