Apache Pig - How to get number of matching elements between multiple bags?

Question

I'm a new user of Apache Pig and I have a problem to solve.

I'm trying to make a little search engine with apache pig. The idea is simple: I have a file, which is the concatenation of multiple documents (one document per line). Here is an example with three documents:

1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3
3,word1 word3 word4 word5

Then, I create a Bag of words for each documents, using these lines of code:

docs = LOAD '$documents' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE line;
C = FOREACH B GENERATE TOKENIZE(line) as gu;

Then, i remove duplicate entries on bags:

filtered = FOREACH C {
    uniq = DISTINCT gu;
    GENERATE uniq;
}

Here are the results of this code:

DUMP filtered;

({(word1), (word4),  (word2)})
({(word2), (word6),  (word1), (word5), (word3)})
({(word1), (word3),  (word4), (word5)})

So I have a bag of words per document, like I wanted.

Now, let's consider the user query as a file:

word2 word7 word5

I transform the query to a bag of words:

query = LOAD '$query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS quer;

DUMP bag_query;

Here are the results:

({(word2), (word7), (word5)})

Now, here is my problem: i would like to get the number of matches betwen the query and each document. With this example, I would like to have this output:

1
2
1

I tried to make a JOIN between the bags but it didn't worked.

Could you help me please ?

Thank you.

score 1 · Answer 1 · answered May 22 '13 at 09:13

1

Try using SetIntersect (a Datafu UDF - https://github.com/linkedin/datafu) and SIZE to get the number of elements in the result bag.

answered May 22 '13 at 09:13

SNeumann

1,158
9
12

Thanks for your response but it doesn't work. Indeed, my bags are in separate variables and it seems that SetIntersect requires the bags to be in the same variables. – shanks_roux May 22 '13 at 13:24

score 1 · Accepted Answer · answered May 22 '13 at 17:14

If you are ok not to use any of the UDFs, than it can be done by pivoting the bags and going all SQL style.

docs = LOAD '/input/search.dat' USING PigStorage(',') AS (id:int, line:chararray);
C = FOREACH docs GENERATE id, TOKENIZE(line) as gu;
pivoted = FOREACH C {
    uniq = DISTINCT gu;
        GENERATE id, FLATTEN(uniq) as word;
};
filtered = FILTER pivoted BY word MATCHES '(word2|word7|word5)';
--dump filtered;
count_id_matched = FOREACH (GROUP filtered BY id) GENERATE group as id, COUNT(filtered) as count;

dump count_id_matched;

count_word_matched_in_docs = FOREACH (GROUP filtered BY word) GENERATE group as word, COUNT(filtered) as count;

dump count_word_matched_in_docs;

score 0 · Answer 3 · answered Dec 03 '13 at 18:03

As SNeumann pointed out, you can use DataFu's SetIntersect for your example.

Building off your example, given these documents:

1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3 word7
3,word1 word3 word4 word5

And given this query:

word2 word7 word5

Then this code gives you what you want:

define SetIntersect datafu.pig.sets.SetIntersect();

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE id, line;
C = FOREACH B GENERATE id, TOKENIZE(line) as gu;

filtered = FOREACH C {
  uniq = DISTINCT gu;
  GENERATE id, uniq;
}

query = LOAD 'query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query;
-- sort the bag of tokens, since SetIntersect requires it
bag_query = FOREACH bag_query {
  query_sorted = ORDER query BY token;
  GENERATE query_sorted;
}

result = FOREACH filtered {
  -- sort the tokens, since SetIntersect requires it
  tokens_sorted = ORDER uniq BY token;
  GENERATE id, 
           SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt;
}

DUMP result;

Values for result:

(1,1)
(2,3)
(3,1)

Here is a fully working example that you can paste into the DataFu unit tests for SetIntersect located here:

/**
register $JAR_PATH

define SetIntersect datafu.pig.sets.SetIntersect();

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE id, line;
C = FOREACH B GENERATE id, TOKENIZE(line) as gu;

filtered = FOREACH C {
  uniq = DISTINCT gu;
  GENERATE id, uniq;
}

query = LOAD 'query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query;
-- sort the bag of tokens, since SetIntersect requires it
bag_query = FOREACH bag_query {
  query_sorted = ORDER query BY token;
  GENERATE query_sorted;
}

result = FOREACH filtered {
  -- sort the tokens, since SetIntersect requires it
  tokens_sorted = ORDER uniq BY token;
  GENERATE id, 
           SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt;
}

DUMP result;

 */
@Multiline
private String setIntersectTestExample;

@Test
public void setIntersectTestExample() throws Exception
{    
  PigTest test = createPigTestFromString(setIntersectTestExample);    

  writeLinesToFile("docs", 
                   "1,word1 word4 word2 word1",
                   "2,word2 word6 word1 word5 word3 word7",
                   "3,word1 word3 word4 word5");

  writeLinesToFile("query", 
                   "word2 word7 word5");

  test.runScript();

  super.getLinesForAlias(test, "filtered");
  super.getLinesForAlias(test, "query");
  super.getLinesForAlias(test, "result");
}

If you have any other similar use cases I'd love to hear them :) We are always looking to contribute more useful UDFs to DataFu.

Apache Pig - How to get number of matching elements between multiple bags?

3 Answers3