So I need to split a txt file into a testing and a training file (also txt). I've run the code below for a smaller data set and it works perfectly. But it fails when I try to load the complete data set (3gb) and get a zsh:killed. Is there any way to avoid this?
Here is how the dataset looks:
WritingSkills | Lorem ipsum dolor sit amet, consectetur adipiscing elit.
CommunicationSkills | Lorem ipsum dolor sit amet, consectetur adipiscing elit.
MicrosoftExcel | Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Python | Lorem ipsum dolor sit amet, consectetur adipiscing elit.
from sklearn.model_selection import train_test_split
import numpy
with open("/Users/luisguillermo/CGC-IBM/entity_mapping/ms-lstm/ms-lstm/textfile.txt", "r") as f:
print ('starting...')
data = f.read().split('\n')
data = numpy.array(data) #convert array to numpy type array
print ('text file in array')
x_train ,x_test = train_test_split(data,test_size=0.05)
del data
print ('data in arrays...')
# Remove empty fields in the list
x_train = list(filter(None, x_train))
x_test = list(filter(None, x_test))
print ('writing to training file')
with open('textfile_train.txt', 'w') as train:
train.write("\n".join(i for i in x_train))
print ('Training file Done')
print ('writing to test file')
with open('textfile_test.txt', 'w') as test:
test.write("\n".join(i for i in x_test))
print ('Done')
Also, I was also looking if I could run it in the cloud if someone knows a good provider for this.