I'm working on a project which involves the use of a sequential supervised machine learning model,which i am using to extract data from unstructured text data.The diversity of the data is vast.
So,i'm planning to create a training set with huge amount of data, and randomly choose some test data to check the efficiency of the model.My question is whether,an increase in the amount of data present in the training set would increase the efficiency of the machine learning model?If not, how can i improve the model?
Also if i test the model with a sample data,whose scope is beyond the training set(i.e.Data which is dissimilar to the training set),then how can i make the model to deal with it and produce a proper result?
And if i frequently test the data, would it really learn from it(Or would it just generate a result based on the existing training data set)?