This is possible in Hive. Split by non-alpha characters and use lateral view+explode, then count words:
with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)
select w.word, count(*) cnt
from
(
select split(lower(initial_string),'[^a-zA-Z]+') words from your_data
)s lateral view explode(words) w as word
where w.word!=''
group by w.word;
Result:
word cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1
One more method using sentences
function, it returns array of tokenized sentences (array of array of words):
with your_data as(
select stack(2,
'Hey, how are you?',
'Hey, Who is there?'
) as initial_string
)
select w.word, count(*) cnt
from
(
select sentences(lower(initial_string)) sentences from your_data
)d lateral view explode(sentences) s as sentence
lateral view explode(s.sentence) w as word
group by w.word;
Result:
word cnt
are 1
hey 2
how 1
is 1
there 1
who 1
you 1
sentences(string str, string lang, string locale) function tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The 'lang' and 'locale' are optional arguments. For example, sentences('Hello there! How are you?') returns [["Hello", "there"], ["How", "are", "you"]]