0

I have a working installation of rasa_nlu, running Python 3.6.5 on macOS High Sierra. I was able to get the sample tutorial working. I'm running into trouble getting it to work with synonyms.

Here's the relevant portion of my training file, first-model.md.

## intent:select
- what is the [max](operator) rating?

## synonym:max
- best
- highest
- maximum

Now, rasa_nlu correctly detects the intent and entity for a question such as what is the max rating?

{'intent': {'name': 'select', 'confidence': 0.9542820453643799},
 'entities': [{'start': 12,
   'end': 15,
   'value': 'max',
   'entity': 'operator',
   'confidence': 0.8146240434922525,
   'extractor': 'ner_crf'}],
 'intent_ranking': [{'name': 'select', 'confidence': 0.9542820453643799},
  {'name': 'identity', 'confidence': 0.036332450807094574}],
 'text': 'what is the max rating?'}

However, when I use a synonym in the question, it doesn't detect the entity. For example, what is the best rating?

{'intent': {'name': 'select', 'confidence': 0.9382177591323853},
 'entities': [],
 'intent_ranking': [{'name': 'select', 'confidence': 0.9382177591323853},
  {'name': 'identity', 'confidence': 0.10226328670978546}],
 'text': 'what is the best rating?'}

No dice with synonym. I've tried this both with spacy_sklearn and tensorflow_embedding, and see similar results.

Would greatly appreciate any pointers.

Cheers.

Update: Per @Caleb's suggestion below, I updated the training to:

## intent:select
- what is the [max](operator) rating?
- what is the [highest](operator:max) rating?
- what is the [maximum](operator:max) rating?
- what is the [best](operator:max) rating?

While it improves the situation, it doesn't fully solve the problem. Now the system returns each synonym (e.g. highest, maximum, best) as the entity value instead of the actual value (max). For example, if I ask what is the best rating?, I expect max as the entity value, not best. Unfortunately, the system returns best.

{'intent': {'name': 'select', 'confidence': 0.9736428260803223},
 'entities': [{'start': 12,
   'end': 16,
   'value': 'best',
   'entity': 'operator',
   'confidence': 0.9105035376516767,
   'extractor': 'ner_crf'}],
 'intent_ranking': [{'name': 'select', 'confidence': 0.9736428260803223},
  {'name': 'identity', 'confidence': 0.0}],
 'text': 'what is the best rating?'}
Deven
  • 156
  • 1
  • 1
  • 13

2 Answers2

1

I stumbled across a combination that works for my use case.

  1. Use json format instead of markdown for the training data (see below for example)
  2. Use spacy_sklearn pipeline instead of tensorflow_embedding (see below for example)

I'm sure there's a good explanation for why that combination works, and others don't, but I don't have a handle on that yet. Alternatively, perhaps there's other configuration required to get other combinations working.

Cheers.

Here's the JSON version of the training data.

{
    "rasa_nlu_data": {
        "common_examples": [
              {
                "text": "what is the best rating?",
                "intent": "select",
                "entities": [
                  {
                    "start": 12,
                    "end": 16,
                    "value": "max",
                    "entity": "operator"
                  }
                ]
              },
              {
                "text": "what is the max rating?",
                "intent": "select",
                "entities": [
                  {
                    "start": 12,
                    "end": 15,
                    "value": "max",
                    "entity": "operator"
                  }
                ]
              },
              {
                "text": "what is the highest rating?",
                "intent": "select",
                "entities": [
                  {
                    "start": 12,
                    "end": 19,
                    "value": "max",
                    "entity": "operator"
                  }
                ]
              }
        ],
        "regex_features" : [],
        "entity_synonyms": [
            {
                "entity": "operator",
                "value": "max",
                "synonyms": ["maximum", "most", "highest", "biggest", "best"]
            }
        ]
    }
}

And here's the pipeline I used (thanks @Caleb for the suggestion to include it as well).

language: "en_core_web_md"
pipeline: "spacy_sklearn"
Deven
  • 156
  • 1
  • 1
  • 13
  • 2
    I do think JSON makes it easier. As far as pipeline, my guess is you just needed to add `ner_synonyms` to your pipeline for it to work in the tensorflow one as well. – Caleb Keller Aug 20 '18 at 23:52
0

Please see the note located on this page of the docs.

Please note that adding synonyms using the above format does not improve the model’s classification of those entities. Entities must be properly classified before they can be replaced with the synonym value.

This means that you need to include some of these other words in your training data so that the entity classifier learns to correctly classify those words as that entity. Once the word is correctly classified, then synonyms can kick in and normalize it.

It's also possible to use tools like chatito based on a single intent example and a list of entities/synonyms. But be careful because using templates like this can cause overfitting if you use too many examples for a single sentence structure.

Caleb Keller
  • 2,151
  • 17
  • 26
  • Thanks, @caleb-keller. Please see my updated note above. The system now recognizes the entity, but doesn't seem to map the synonyms back to to the unique value. Should I be setting up the examples differently? I understand the point re: overfitting, but I'm not yet sure how to train the system -- for example, I'll have multiple values for that `operator` entity, each with multiple synonyms. I definitely need a good training scheme so the system recognizes the values for the entity and its synonyms (without overfitting). – Deven Aug 20 '18 at 17:14
  • can you add your pipeline to your answer? – Caleb Keller Aug 20 '18 at 23:48
  • Done -- I added the pipeline to my answer. – Deven Aug 21 '18 at 16:10