I have a list of (possibly long) strings.
When i convert it to np.array i quite fast run out of RAM because it seems to take much more memory than a simple list. Why and how to deal with it? Or maybe I'm just doing something wrong?
The code:
import random
import string
import numpy as np
from sys import getsizeof
cnt = 100
sentences = []
for i in range(0, cnt):
word_cnt = random.randrange(30, 100)
words = []
for j in range(0, word_cnt):
word_length = random.randrange(20)
letters = [random.choice(string.ascii_letters) for x in range(0, word_length)]
words.append(''.join(letters))
sentences.append(' '.join(words))
list_size = sum([getsizeof(x) for x in sentences]) + getsizeof(sentences)
print(list_size)
arr = np.array(sentences)
print(getsizeof(arr))
print(arr.nbytes)
The output:
76345
454496
454400
I'm not sure if i use getsizeof()
correctly, but I started to investigate it when I noticed memory problems so I'm pretty sure there's something going on : )
(Bonus question)
I'm trying to run something similar to https://autokeras.com/examples/imdb/. The original example requires about 3GB of memory, and I wanted to use a bigger dataset. Maybe there's some better way?
I'm using python3.6.9 with numpy==1.17.0 on Ubuntu 18.04.