1

I am playing around with the Luis stock ticker example here, GitHub MicrosoftBotBuilder Example, it works well and the entity in the utterances is identified but there are stock tickers in the world that have periods in them such as bt.a

Luis by default pre-processes utterances where word breaks are inserted around punctuation characters and therefore an utterance of "what is price of bt.a" becomes "what is price of bt. a" and therefore Luis thinks the entity is "bt" instead of "bt.a"

Does anyone know how to get around this? Thx

Daniel A. White
  • 187,200
  • 47
  • 362
  • 445
user2574678
  • 649
  • 12
  • 20

2 Answers2

3

This is how LUIS tokenizes utterances and I don't think it'll change int he near future. I think you can investigate one of the 2 solutions:

  1. Preprocess the utterance and normalize entities with punctuation (maybe save them in a map), and reverse the process when LUIS is called and the entities have been extracted.
  2. Use phrase list features and add the entities that LUIS misses in their Tokenized form, label the entity tokens in the utterance, and retrain the model (I suggest you try that in a clone of your app, so you don't lose any current progress)
Mokhtar Ashour
  • 600
  • 2
  • 9
  • 21
1

I need to process sentences with website addresses in them so I had to deal with a few different symbols. I found a technique that works for me, but it is not very elegant.

I am assuming here that you have an entity setup to represent the "stock symbol"

Here is what this would look like in your case.

  1. Detect the cases when LUIS gets the "stock symbol" entity wrong. In your case this may be whenever it ends in a period.
  2. When LUIS gets the entity wrong, tokenize the raw query using spaces as the separator. Grab the proper token by looking for a match with the wrong partial token.

So for your example....

"what is price of bt.a"

You would see the "stock symbol" entity of "bt." and know that it is wrong because it ends in a period. You would then tokenize the query and look for tokens that contain "bt.". This would identify "bt.a" as the requested symbol.

Its not pretty, but in the case of website addresses has been reliable.

Bryan Hall
  • 23
  • 5
  • Thanks for your feedback, I ended up adding a few entities into the phrase list and Luis eventually picked up on my lead and learnt from it. – user2574678 Aug 05 '16 at 04:58