I have a requirement where I need to JOIN a tweets table with person names, like filtering the tweets if it contains any person name. I have following data:
Tweets Table: (70 million records stored as a HIVE Table)
id | tweet |
---|---|
1 | Cristiano Ronaldo greatest of all time |
2 | Brad Pitt movies |
3 | Random tweet without any person name |
Person Names: (1.6 million names stored on HDFS as .tsv file)
id | person_name |
---|---|
1 | Cristiano Ronaldo |
2 | Brad Pitt |
3 | Angelina Jolie |
Expected Result:
id | tweet | person_name |
---|---|---|
1 | Cristiano Ronaldo greatest of all time | Cristiano Ronaldo |
2 | Brad Pitt movies | Brad Pitt |
What I've tried so far:
I have converted the person names .tsv file to HIVE table as well and then tried to join 2 tables with the following HIVE query:
SELECT * FROM tweets t INNER JOIN people p WHERE instr(t.tweet, p.person_name) > 0;
Tried with some sample data and it works fine. But when I try to run on entire data (70m tweets JOIN with 1.6m Person Names), it takes forever. Definitely doesn't look very efficient.
I wanted to try JOIN with PIG as well (as it is considered little more efficient than HIVE JOIN), where I can directly JOIN person names .tsv file tweets HIVE Table, but not sure how to JOIN based on substring in PIG.
Can someone please share the PIG JOIN syntax for this problem, if you have any idea? Also, please do suggest me any alternatives that I can use?