0

So I need to split a txt file into a testing and a training file (also txt). I've run the code below for a smaller data set and it works perfectly. But it fails when I try to load the complete data set (3gb) and get a zsh:killed. Is there any way to avoid this?

Here is how the dataset looks:

WritingSkills | Lorem ipsum dolor sit amet, consectetur adipiscing elit.
CommunicationSkills | Lorem ipsum dolor sit amet, consectetur adipiscing elit.
MicrosoftExcel | Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Python | Lorem ipsum dolor sit amet, consectetur adipiscing elit.

from sklearn.model_selection import train_test_split
import numpy


with open("/Users/luisguillermo/CGC-IBM/entity_mapping/ms-lstm/ms-lstm/textfile.txt", "r") as f:

    print ('starting...')
    
    data = f.read().split('\n')
    data = numpy.array(data)  #convert array to numpy type array

print ('text file in array')

x_train ,x_test = train_test_split(data,test_size=0.05)

del data

print ('data in arrays...')

# Remove empty fields in the list     
x_train = list(filter(None, x_train))
x_test = list(filter(None, x_test))

print ('writing to training file')

with open('textfile_train.txt', 'w') as train:
    train.write("\n".join(i for i in x_train))

print ('Training file Done')

print ('writing to test file')

with open('textfile_test.txt', 'w') as test:
    test.write("\n".join(i for i in x_test))
    
print ('Done')

Also, I was also looking if I could run it in the cloud if someone knows a good provider for this.

0 Answers0