0

I have a table with this structure:

user_id | message_id | content
   1    |      1     | "I like cats"
   1    |      1     | "I like dogs"

And a list of valid words in dictionary.txt (or an external hive table), for example:

I,like,dogs,cats,lemurs

And my goal is to generate an word-count table for each user

user_id  |  "I"  |  "like"  |  "dogs"  |  "cats"  |  "lemurs"
   1     |   2   |     2    |     1    |     1    |     0

This is what I tried so far:

SELECT user_id, word, COUNT(*) 
FROM messages LATERAL VIEW explode(split(content, ' ')) lTable as word 
GROUP BY user_id,word;
CAFEBABE
  • 3,983
  • 1
  • 19
  • 38
Uri Goren
  • 13,386
  • 6
  • 58
  • 110

2 Answers2

1

I am not very familiar with doing Pivot on Hive, but in pig it can be possible to do.

DEFINE GET_WORDCOUNTS com.stackoverflow.pig.GetWordCounts('$dictionary_path');

A = LOAD .... AS user_id, message_id, content; 

C = GROUP B BY (user_id);

D = FOREACH C GENERATE group, FLATTEN(GET_WORDCOUNTS(B.content));

You will have to write a simple UDF GetWordCounts which tokenizes your input content for each grouped record, and checks with input dictionary.

rahulbmv
  • 704
  • 3
  • 16
1

Check this :

select ename, 
length(ename)-length(replace(ename,'A', '')) A,
length(ename)-length(replace(ename,'W', '')) W 
FROM EMP;

Else you can define a variable(your search string) and place it in the place of 'A', 'W' etc

Uri Goren
  • 13,386
  • 6
  • 58
  • 110
siva krishna
  • 122
  • 8