Need to know how to properly regression test a Dialogflow agent - multiple, conflicting options

Question

I've been working with Dialogflow for several months now - really enjoy the tool. However, it's very important to me to know if my changes (in training) are making intent mapping accuracy better or worse over time. I've talked to numerous people about the best way to do this using Dialogflow and I've gotten at least 4 different answers. Can you help me out here? Here are the 4 answers I've received on how to properly train/test your agent. Please note that ALL of these answers involve generating a confusion matrix...

"Traditional" - randomly split all your input data into 80/20% - use the 80% for training and the 20% for testing. Every time you train your agent (because you've collected new input data), start the process all over again - meaning randomly split all your input data (old and new) into 80/20%. In this model, a piece of training data for one agent iteration might be used as a piece of test data on another iteration - and vice-versa. I've seen variations on this option (KFold and Monte Carlo).
"Golden Set" - similar to the above except that the initial 80/20% training and testing sets that you use to create your first agent continue to grow over time as you add more input data. In this model, once a piece of input data has been tagged as training data, it will NEVER be used as testing data - and once a piece of input data has been tagged as testing data, it will NEVER be used as training data. The 2 initial sets of training and testing data just continue to grow over time as new inputs are randomly split into the existing sets.
"All Data is Training Data - Except for Fake Testing Data" - In this model, we use all the input data as training data. We then copy a portion of the input data (20%) and "munge it" - meaning that we inject characters or words into the 20% portion - and use that as test data.
"All Data is Training Data - and Some of it is Testing Data Also" - In this model, we use all the input data as training data. We then copy a portion of the input data (20%) and use it as testing data. In other words, we are testing our agent using a portion of our (unmodified) training data. A variation on this option is to still use all your inputs as training data but sort your inputs by "popularity/usage" and take the top 20% for testing data.

If I were creating a bot from scratch, I'd simply go with option #1 above. However, I'm using an off-the-shelf product (Dialogflow) and it isn't clear to me that traditional testing is required. Golden Set seems like it will (mostly) get me to the same place as "traditional" so I don't have a big problem with it. Option #3 seems bad - creating fake testing data sounds problematic on many levels. And option #4 is using the same data to test as it uses to train - which scares me.

Anyway, would love to hear some thoughts on the above from the experts!

good question! been using Dialogflow as well and I've been asking the same question about training... Looking forward to see some answers on this topic. — Liam roels, May 02 '19 at 07:13

Need to know how to properly regression test a Dialogflow agent - multiple, conflicting options

0 Answers0