1

I have a large txt file with 50000+ line, how can I randomly split them into 70% for training, 20% for test, and 10% for dev.

result expecting : train.txt, test.txt, dev.txt

Kimi Shui
  • 53
  • 1
  • 4

2 Answers2

1

I found this code much simpler.

## allocating train, test and validate datasets
import random 

fin = open('unique.txt', 'rb') 
f75out = open("train.txt", 'wb') 
f125aout = open("test.txt", 'wb')
f125bout = open("validate.txt", 'wb')

for line in fin: 
  r = random.random() 
  if (0.0 <=  r <= 0.75): 
    f75out.write(line) 
  elif (0.75 < r <= 0.875): 
    f125aout.write(line) 
  else:
    f125bout.write(line)
fin.close() 
f75out.close() 
f125aout.close() 
f125bout.close() 
MuneshSingh
  • 162
  • 1
  • 10
0

Check out Scikit Learn's method train_test_split() to split your training data in subsets.

Then just save your variables in files.

Elger
  • 136
  • 8
  • so, will be ``` data = 'file.txt' train,test,dev = train_test_split(data,train_size=0.8, test_size=0.1, dev_size=0.1) ``` is that correct? – Kimi Shui Aug 25 '21 at 00:28
  • You would actually use train_test_split() twice. See https://stackoverflow.com/a/42932524/16744221 – Elger Aug 25 '21 at 00:38