2

Before posting, I tried the hive sentences function and did some search but couldn't get a clear understanding, my question is based on what delimiter hive sentences function breaks each sentence? hive manual says "appropriate boundary" what does that mean? Below is an example of my tries, I tried adding period (.) and exclamatory sign(!) at different points of the sentence. I'm getting different outputs, can someone explain on this?

with period (.)

select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable

output - 1 array

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

with '!'

select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable

output - 2 arrays

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
Community
  • 1
  • 1
user7343922
  • 316
  • 4
  • 17

2 Answers2

1

If you understand the functionality of sentences()..it clears your doubt.

Definition of sentences(str):

Splits str into arrays of sentences, where each sentence is an array of words.

Example:

SELECT sentences('Hello there! I am a UDF.') FROM src LIMIT 1;

[ ["Hello", "there"], ["I", "am", "a", "UDF"] ]



SELECT sentences('review . language') FROM movies;

[["review","language"]]

An exclamation point is a type of punctuation mark that goes at the end of a sentence. Other examples of related punctuation marks include periods and question marks, which also go at the end of sentences.But as per the definition of sentences() ,Unnecessary punctuation, such as periods and commas in English, is automatically stripped.So,we are able to get two arrays of words with !. It completely involves java.util.Locale.java

0

I don't know the actual reason but observed after period(.) if you put space and next word first letter as capital then it is working. Here I changed from where to Where it it worked. However this is not require for !

Tokenizes a string of natural language text into words and sentences. Where each sentence is broken at the appropriate sentence boundary and returned as an array of words.

And this is giving below output

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["Where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
KAASSS
  • 21
  • 2