I'm a new user of Apache Pig and I have a problem to solve.
I'm trying to make a little search engine with apache pig. The idea is simple: I have a file, which is the concatenation of multiple documents (one document per line). Here is an example with three documents:
1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3
3,word1 word3 word4 word5
Then, I create a Bag of words for each documents, using these lines of code:
docs = LOAD '$documents' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE line;
C = FOREACH B GENERATE TOKENIZE(line) as gu;
Then, i remove duplicate entries on bags:
filtered = FOREACH C {
uniq = DISTINCT gu;
GENERATE uniq;
}
Here are the results of this code:
DUMP filtered;
({(word1), (word4), (word2)})
({(word2), (word6), (word1), (word5), (word3)})
({(word1), (word3), (word4), (word5)})
So I have a bag of words per document, like I wanted.
Now, let's consider the user query as a file:
word2 word7 word5
I transform the query to a bag of words:
query = LOAD '$query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS quer;
DUMP bag_query;
Here are the results:
({(word2), (word7), (word5)})
Now, here is my problem: i would like to get the number of matches betwen the query and each document. With this example, I would like to have this output:
1
2
1
I tried to make a JOIN between the bags but it didn't worked.
Could you help me please ?
Thank you.