0

I have a list of (possibly long) strings.

When i convert it to np.array i quite fast run out of RAM because it seems to take much more memory than a simple list. Why and how to deal with it? Or maybe I'm just doing something wrong?

The code:

import random
import string
import numpy as np
from sys import getsizeof

cnt = 100 
sentences = []

for i in range(0, cnt):
    word_cnt = random.randrange(30, 100)
    words = []
    for j in range(0, word_cnt):
        word_length = random.randrange(20)
        letters = [random.choice(string.ascii_letters) for x in range(0, word_length)]
        words.append(''.join(letters))
    sentences.append(' '.join(words))

list_size = sum([getsizeof(x) for x in sentences]) + getsizeof(sentences)
print(list_size)

arr = np.array(sentences)
print(getsizeof(arr))
print(arr.nbytes)

The output:

76345
454496
454400

I'm not sure if i use getsizeof() correctly, but I started to investigate it when I noticed memory problems so I'm pretty sure there's something going on : )

(Bonus question)

I'm trying to run something similar to https://autokeras.com/examples/imdb/. The original example requires about 3GB of memory, and I wanted to use a bigger dataset. Maybe there's some better way?

I'm using python3.6.9 with numpy==1.17.0 on Ubuntu 18.04.

jbet
  • 452
  • 4
  • 12
  • 1
    What's the `arr.dtype`? This should be length of the longest string, all strings will occupy this space. For strings of widely varying lengths, the array storage format is less efficient than Python's native strings. Also, `numpy` doesn't off much in the way special string handling. Its fast code only works with numeric dtypes. `pandas` uses `object` dtype and native Python strings. – hpaulj Feb 18 '20 at 21:00
  • `arr.dtype` is ` – jbet Feb 18 '20 at 23:11
  • 1
    1126 char * 4 bytes/char * 100 elements = 450400 bytes. So the `nbytes` and `getsizeof` make sense. – hpaulj Feb 18 '20 at 23:16
  • OK, i get it now, but there's another thing. When i run other version of this script, with **all sentences and words of the same length**, the result is `215712 839696 839600` So numpy still requires about 4x more memory. Why? – jbet Feb 19 '20 at 10:25

0 Answers0