-1

I'm working on a project which involves the use of a sequential supervised machine learning model,which i am using to extract data from unstructured text data.The diversity of the data is vast.

So,i'm planning to create a training set with huge amount of data, and randomly choose some test data to check the efficiency of the model.My question is whether,an increase in the amount of data present in the training set would increase the efficiency of the machine learning model?If not, how can i improve the model?

Also if i test the model with a sample data,whose scope is beyond the training set(i.e.Data which is dissimilar to the training set),then how can i make the model to deal with it and produce a proper result?

And if i frequently test the data, would it really learn from it(Or would it just generate a result based on the existing training data set)?

Gobi S
  • 1
  • 3

1 Answers1

0

What you typically do is to use an extensive dataset, and then split that dataset randomly.

For example, if you have 100 000 rows of data to train your model with, you could give a random 80% of that data to train the model with, and use the remaining 20 000 rows to validate it. This is a common pattern in Machine Learning.

In this approach, you can now work with your model to improve the scores you get.

You do NOT want to create false testdata on your model.

Pedro G. Dias
  • 3,162
  • 1
  • 18
  • 30