Split a text file into multiple files by percentage for test and train

Question

I have a large txt file with 50000+ line, how can I randomly split them into 70% for training, 20% for test, and 10% for dev.

result expecting : train.txt, test.txt, dev.txt

score 1 · Answer 1 · answered Mar 17 '22 at 08:20

I found this code much simpler.

## allocating train, test and validate datasets
import random 

fin = open('unique.txt', 'rb') 
f75out = open("train.txt", 'wb') 
f125aout = open("test.txt", 'wb')
f125bout = open("validate.txt", 'wb')

for line in fin: 
  r = random.random() 
  if (0.0 <=  r <= 0.75): 
    f75out.write(line) 
  elif (0.75 < r <= 0.875): 
    f125aout.write(line) 
  else:
    f125bout.write(line)
fin.close() 
f75out.close() 
f125aout.close() 
f125bout.close()

score 0 · Answer 2 · answered Aug 25 '21 at 00:12

0

Check out Scikit Learn's method train_test_split() to split your training data in subsets.

Then just save your variables in files.

answered Aug 25 '21 at 00:12

Elger

136
8

so, will be ``` data = 'file.txt' train,test,dev = train_test_split(data,train_size=0.8, test_size=0.1, dev_size=0.1) ``` is that correct? – Kimi Shui Aug 25 '21 at 00:28
You would actually use train_test_split() twice. See https://stackoverflow.com/a/42932524/16744221 – Elger Aug 25 '21 at 00:38

Split a text file into multiple files by percentage for test and train

2 Answers2