I trained a custom entity recognizer using AWS comprehend for an entity extraction problem. The trained recognizer uses default train and test data-split which here splits test data more than train data. This affects the recognizer metrics. Also these values(number of train & test documents) are higher than the total input "train.csv" file added in s3 bucket for training. Total number of inputs given in csv file : 1010 Recognizer used train documents : 2480 Recognizer used test document : 3270
Can we specify the training and test document percentage in AWS comprehend Custom Entity Recognizer?
Asked
Active
Viewed 112 times
0
-
question is not pretty much clear. – user269867 Jul 31 '19 at 00:02
-
@Navya I have the same issue. Seems like Comprehend is treating each line in the file as training documents. So, your 1010 documents might have multiple lines (2480+3270 = 5750). I am still puzzled 2480 train documents and 3270 test documents. Usually it is other way around, but seems like they are doing upside down! – i.n.n.m Apr 03 '20 at 06:46