Hive for bag of words (word count for each word in the dictionary)

Question

I have a table with this structure:

user_id | message_id | content
   1    |      1     | "I like cats"
   1    |      1     | "I like dogs"

And a list of valid words in dictionary.txt (or an external hive table), for example:

I,like,dogs,cats,lemurs

And my goal is to generate an word-count table for each user

user_id  |  "I"  |  "like"  |  "dogs"  |  "cats"  |  "lemurs"
   1     |   2   |     2    |     1    |     1    |     0

This is what I tried so far:

SELECT user_id, word, COUNT(*) 
FROM messages LATERAL VIEW explode(split(content, ' ')) lTable as word 
GROUP BY user_id,word;

Why the vote downs ? an explanation would be far more helpful — Uri Goren, Mar 06 '16 at 13:24
How do incorporate the predefined dictionary file and generate a row that has a constant length (not depending on the number of unique words) — Uri Goren, Mar 06 '16 at 15:44

score 1 · Answer 1 · answered Mar 08 '16 at 21:36

I am not very familiar with doing Pivot on Hive, but in pig it can be possible to do.

DEFINE GET_WORDCOUNTS com.stackoverflow.pig.GetWordCounts('$dictionary_path');

A = LOAD .... AS user_id, message_id, content; 

C = GROUP B BY (user_id);

D = FOREACH C GENERATE group, FLATTEN(GET_WORDCOUNTS(B.content));

You will have to write a simple UDF GetWordCounts which tokenizes your input content for each grouped record, and checks with input dictionary.

score 1 · Accepted Answer · edited Mar 15 '16 at 08:34

1

Check this :

select ename, 
length(ename)-length(replace(ename,'A', '')) A,
length(ename)-length(replace(ename,'W', '')) W 
FROM EMP;

Else you can define a variable(your search string) and place it in the place of 'A', 'W' etc

edited Mar 15 '16 at 08:34

Uri Goren

13,386
6
58
110

answered Mar 15 '16 at 07:12

siva krishna

122
8

This trick counts the amount of characters that were replaced, not the amount of replacements – Uri Goren Mar 15 '16 at 08:35

Hive for bag of words (word count for each word in the dictionary)

This is what I tried so far:

2 Answers2