2

I am building a bot with Rasa.ai.When training the bot with Rasa NLU, we use a training data file where the text, intent, entity etc. are specified. For example for a simple restaurant chatbot, the training file data.json may contain

{
        "text": "central indian restaurant",
        "intent": "restaurant_search",
        "entities": [
          {
            "start": 0,
            "end": 7,
            "value": "central",
            "entity": "location"
          },
          {
            "start": 8,
            "end": 14,
            "value": "indian",
            "entity": "cuisine"
          }
        ]
      }

We use this to train the model. But we need to create this training file manually (or through a GUI).

Is there any tool where I can feed sentences and it can automatically create intent and entity?

Sample Input: Is there any central Indian restaurant?
Sample Output: The above data.json

EDIT:

To better explain this question - suppose I have a huge set of customer service call log. My understanding is with Rasa (or other similar framework) - a human being need to go through the call log and understand all possible intents, entity combination that happened in the past and create a file like data.json such as above before training the model. This seems like a really unscalable problem. Is there a way to generate that data.json file from those GB size call logs without involving a human being? Am I missing something here?

nad
  • 2,640
  • 11
  • 55
  • 96
  • I also responded to your question in the Rasa Gitter, but if I am understanding the request correctly then what you are asking for doesn't make sense. If there was a tool to automatically label the intents and entities and do it correctly then you wouldn't have a need for Rasa NLU. you could just pass user texts directly into that tool. What exactly is it that you are after? – Caleb Keller May 14 '18 at 02:23
  • @CalebKeller please see the updated EDIT section for clarity. – nad May 14 '18 at 02:49

4 Answers4

3

This is exactly the task which you are training Rasa NLU to perform. Take in sentences and turn them into structured output. By providing examples, you are teaching the model how this works.

So you don't have to provide annotations for gigabytes of customer logs, but just some and the algorithm should generalise to the other sentences which it hasn't seen yet. How well this works depends on how many intents you have, how complex they are, and other factors.

I would start by annotating a few hundred sentences (the markdown format is a bit easier actually), keep 50 or so examples separate, and see how well Rasa NLU predicts them. Keep annotating more and more examples and add them to your training data, until you are happy with the performance on the held-out examples.

amn41
  • 1,164
  • 1
  • 9
  • 17
3

A fast way to generate arbitrarily big training datasets with a few rows of code is Chatito

  1. You write down typical sentences and synonyms for the entities in an intuitive DSL.
  2. It generates for you all the combinations and shuffles them for a better training.
  3. It splits the examples between 2 files, one for training and one for testing. So you can measure the accuracy of your trained language model.
gsid
  • 470
  • 4
  • 6
1

What I am asking is essentially unsupervised learning. Input a bunch of natural languages and output it in intent/entity format that Rasa or any other similar tool require.

This is absent from Rasa or similar tool as they are doing supervised learning. One example tool that might resolve my problem is lang.ai

nad
  • 2,640
  • 11
  • 55
  • 96
0

The idea is to provide the sample sentences only. By providing the sample you are training the model to understand the sentence structure, where to expect the entities, what data type the entities are etc.

However if you just looking for named entity identification, you can use spaCy alone. Just throwing a sentence it will try to detect entities in the sentence. Spacy has already trained models to do so.

Reference: Spacy Named Entities

Karthik Sunil
  • 544
  • 5
  • 15