How to deal with rasa nlu data imbalance problem？

Question

Now I have 12 intents to identify，But the amount of data for each intent is inconsistent，Like meeting settings, reminding these intentions, the amount of data will be thousands.But like greetings, thank you for such an intention, there are very few data samples, maybe only a few dozen.

How do I deal with this data imbalance problem?

My config.yml file content is as follows:

language: en

pipeline:
  - name: "WhitespaceTokenizer"
  - name: "RegexFeaturizer"
  - name: "CountVectorsFeaturizer"
    analyzer: char_wb
    min_ngram: 2
    max_ngram: 5
    stop_words: "english"
  - name: "CRFEntityExtractor"
  - name: "extractor.regex.RegexEntityExtractor"
  - name: "EmbeddingIntentClassifier"
    epochs: 100
    num_neg: 2
  - name: "DucklingHTTPExtractor"
    url: "http://localhost:8000"
    dimensions: ["time", "duration", "phone-number", "distance"]

policies:
  - name: MemoizationPolicy
  - name: EmbeddingPolicy
    epochs: 20
  - name: FormPolicy
  - name: MappingPolicy
  - name: FallbackPolicy
    fallback_action_name: "action_default_fallback"

score 1 · Answer 1 · answered Oct 22 '19 at 11:05

I don't know i have properly understood your question. As far as i understood you don't have to worry about those intents like greet, deny have few data(examples) and others have thousand data(examples).

The problem occurs when you try to deal with multiple intents and those intents are different from each other in a very small way. In situation like that if you do not provide proper and correct data to RASA it will confuse and might give wrong output. You should worry about how to make those data different for each intent and make RASA less confuse so you can get right output.

score 0 · Answer 2 · answered Jan 16 '20 at 11:35

0

Here is Rasa documentation, I hope you get what you need.

Classification algorithms often do not perform well if there is a large class imbalance, for example if you have a lot of training data for some intents and very little training data for others. To mitigate this problem, rasa’s supervised_embeddings pipeline uses a balanced batching strategy.

answered Jan 16 '20 at 11:35

Bidya

179
2
11

1

Yes, this does answer the question, use a setting Rasa provides. But an even better answer would explain the balanced batching strategy in a non-mysterious way. – demongolem Dec 28 '20 at 13:06

How to deal with rasa nlu data imbalance problem？

2 Answers2