1

I have a requirement where I need to JOIN a tweets table with person names, like filtering the tweets if it contains any person name. I have following data:

Tweets Table: (70 million records stored as a HIVE Table)

id tweet
1 Cristiano Ronaldo greatest of all time
2 Brad Pitt movies
3 Random tweet without any person name

Person Names: (1.6 million names stored on HDFS as .tsv file)

id person_name
1 Cristiano Ronaldo
2 Brad Pitt
3 Angelina Jolie

Expected Result:

id tweet person_name
1 Cristiano Ronaldo greatest of all time Cristiano Ronaldo
2 Brad Pitt movies Brad Pitt

What I've tried so far:

I have converted the person names .tsv file to HIVE table as well and then tried to join 2 tables with the following HIVE query:

SELECT * FROM tweets t INNER JOIN people p WHERE instr(t.tweet, p.person_name) > 0;

Tried with some sample data and it works fine. But when I try to run on entire data (70m tweets JOIN with 1.6m Person Names), it takes forever. Definitely doesn't look very efficient.

I wanted to try JOIN with PIG as well (as it is considered little more efficient than HIVE JOIN), where I can directly JOIN person names .tsv file tweets HIVE Table, but not sure how to JOIN based on substring in PIG.

Can someone please share the PIG JOIN syntax for this problem, if you have any idea? Also, please do suggest me any alternatives that I can use?

CM1
  • 13
  • 5
  • `I wanted to try JOIN with PIG as well (as it is considered little more efficient than HIVE JOIN),` this is almost certainly not true these days; Pig hasn't had a release in 4 years whereas Hive is still being actively developed. – Ben Watson Nov 18 '21 at 11:47
  • Thanks. That's right, but I'm curious to try the PIG Script for the same. Do you have any idea what would the syntax for this kind of JOIN look like in PIG? – CM1 Nov 18 '21 at 23:51

2 Answers2

1

It is worth trying Map-Join. Person table is small one and join with it can be converted to Map-Join operator if it fits into memory. Table will be loaded into each mapper memory.

Check EXPLAIN output. If it says that Common Join operator is on Reducer vertex, then try to increase mapper container memory and adjust map-join settings to convert to Map Join.

Settings responsible for Map Join (suppose the People table <2.5Gb) Try to bump mapjoin table size to 2.5Gb (check the actual size) and run explain again.

 set hive.auto.convert.join=true; --this enables map-join
 set hive.auto.convert.join.noconditionaltask = true;
 set hive.mapjoin.smalltable.filesize=2500000000; --size of table to fit in memory
 set hive.auto.convert.join.noconditionaltask.size=2500000000;

Also container size should be increased to avoid OOM (if you are on Tez):

set hive.tez.container.size=8192;  --container size in megabytes
set hive.tez.java.opts=-Xmx6144m;  --set this 80% of hive.tez.container.size

Figures are just an example. Try to adjust and check the EXPLAIN again, if it shows Map-Join operator, then check execution again, it should run much faster.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
1

The idea is to create buckets so that we don't have to compare a lot of records. We are going to increase the number of records / joins to use multiple nodes to do work instead of a large crossjoin.--> WHERE instr(t.tweet, p.person_name) > 0;

  1. I'd suggest splitting the tweets into individual words. Yes multiplying your record count way up.
  2. Filtering out 'stopwords' or some other list of words that fit in memory.
  3. Split names into (firstnames) and "last name"
  4. Join tweets and names on "lastname" and instr(t.tweet, p.person_name) This should significantly reduce the size of data that you compare via a function. It will run faster.

If you are going to do this regularly consider creating tables with sort/bucket to really make things sizzle. (Make it faster as it can hopefully be Sort Merge Join ready.)

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • Could you please providemore details what should be as a result of this step: "splitting the tweets into individual words" For example tweet 1 will become 6 and stop words could be like "of, all, time" and maybe greatest also and filtering will result in two rows Cristiano and Ronaldo? right? – leftjoin Nov 18 '21 at 17:15
  • Have read that there are not so many surnames. For example there are about 6000 Chinese surnames in total. Hmm... makes sense to split. + – leftjoin Nov 18 '21 at 17:21
  • 1
    For example tweet 1 will become 6 --> Yes. Stop words -> are well recognized words that don't really add meaning. That would be a good start but more could be added. (Likely a good source of words that wouldnt' be used as a name.) filtering will result in two rows Cristiano and Ronaldo? --> no only 1 row, really you could use first names if you wanted, you just want to ensure this tweet 'could' have a match. (So either name would work.) European & american culture commonly reuse first names at a higher rate than last names. – Matt Andruff Nov 18 '21 at 18:23
  • @MattAndruff Sounds like a good approach. But getting the celeb's last name itself is tricky here. There's no perfect way to extract it. So, I'm just thinking, instead of splitting the tweet by individual word and join based on celeb's last name, Is there any way in HIVE to split the tweets into bi-grams and tri-grams (because full names can usually be 2-3 word), so that I can directly JOIN with the list of celeb full names? Like for example, for a tweet "christiano ronaldo age", the n-grams would be "christiano ronaldo", "ronaldo age", "christiano ronaldo age". – CM1 Dec 01 '21 at 08:36
  • You don't have to worry about using the last name. Use the first name. Randomly pick any part of the name. That's not the point. The point is that you are reducing the number of potential candidates. (No longer is it a full tables scan for potential matches). You are using a join to reduce the comparisons to potential targets that have at least some chance of matching. – Matt Andruff Dec 01 '21 at 14:49
  • Thanks, this approach worked for me. It just might need some additional processing based on individual's use case, but otherwise it was very efficient. – CM1 Dec 02 '21 at 23:15
  • If you found it worked for you and you feel comfortable doing so could you mark this answer as *the answer*? – Matt Andruff Dec 03 '21 at 15:10