0

I'm using cloud NL to analyze text from Google Speech and it seems to be having trouble with tokenizing contractions for example

"I don't like you"

comes back as tokens whose content_text are:

"I" "do" "n't" "like" "you"

escaping quotes did not help, in this case it came back as

"I" "don" "\'t" "like" "you"

but I found that removing apos' did and the tokens

I dont like you

came back with "dont" as a verb (correct enough)

Is this the correct workaround for now?

user9548
  • 33
  • 1
  • 1
  • 7
  • Why do you consider the response `"I" "do" "n't" "like" "you"` incorrect? The token `"n't"` seems correctly identified as the contracted form of "not". For example you can put the text in the Try The API text box on [this page](https://cloud.google.com/natural-language/) and click the Syntax tab to see the parsed syntax. – dizcology Jul 05 '17 at 21:22
  • https://www.merriam-webster.com/dictionary/don't – user9548 Jul 06 '17 at 01:42
  • I see - depending on your actual application it might or might not be what you want. For instance you probably want "they're" to be split up as two tokens: https://www.merriam-webster.com/dictionary/they're – dizcology Jul 06 '17 at 23:06
  • Not in some cases, for example if I'm passing the token text to a text-to-speech service or otherwise rendering it to be user-visible. – user9548 Jul 07 '17 at 10:05
  • Out curiosity, why do you want to send the tokens to the text-to-speech service instead of the original text? – dizcology Jul 08 '17 at 00:48
  • TBH it's an artifact of my early design and I've been thinking the same thing. I'm going to pass the original text alongside the language results and render that. – user9548 Jul 08 '17 at 13:00

0 Answers0