0

Maybe there's a way around this that I'm missing. Long story short, I have a need for shared memory access, read only, of a large text file. Working with strings of course is necessary. So I'm trying to do this:

import numpy
from multiprocessing import Pool, RawArray

if __name__ == '__main__':
    with open('test.txt', 'r') as fin:
        raw = fin.readlines()
    X_shape = (len(raw), 70) # 70 characters per line should be sufficient for my needs
    X = RawArray('c', X_shape[0] * X_shape[1])
    X_np = np.frombuffer(X).reshape(X_shape)
    numpy.copyto(X_np, raw)

This doesn't work, it fails on the second last line with this output:

ValueError: cannot reshape array of size 102242175 into shape (11684820,70)

For reference the file sample is 11684820 lines long. And 11684820 * 70 is definitely not going to be the number of characters the array claims it is sized for.

Clearly I must be doing something wrong, but this is the only method I see as feasible to multiprocess some CPU bound computations using text file inputs of text files that are several hundred megabytes on the low end and around 6 gigabytes on the high end.

Is there a work around, or perhaps a more correct way of doing this so I can have a large array of strings in shared memory that I can work on with python code? Thanks.

Will
  • 677
  • 3
  • 11
  • 21

1 Answers1

3

numpy.frombuffer needs an explicit dtype, or it will default to dtype=float. Also, a 11684820x70 array of uint8s or 1-character bytestrings isn't the same as a length-11684820 array of 70-character bytestrings, so keep that in mind.

For a 11684820x70 array, the shape you asked for, but probably not what you need:

X_np = np.frombuffer(X, dtype=np.uint8).reshape(X_shape)

For a length-11684820 array of dtype S70 (null-terminated bytestrings of max length 70, described as "not recommended" in the NumPy docs):

X_np = np.frombuffer(X, dtype='S70')

For a length-11684820 array of dtype U70 (null-terminated Unicode strings of max length 70), you'll need a bigger buffer (4 bytes per character), and then

X_np = np.frombuffer(X, dtype='U70')
user2357112
  • 260,549
  • 28
  • 431
  • 505
  • That's a good point, thank you. How can I setup the shared ctype array to be the proper dimensions in that case, I haven't worked with C in a very, very long time. – Will Aug 02 '18 at 18:21
  • Also, that works, but when I get to the next line I get `TypeError: Cannot cast scalar from dtype('S63') to dtype('uint8') according to the rule 'same_kind'`, which is frustrating. I assume S63 is `str`, since that's what the imported text is, and even if I cast it directly into a numpy array, I get this same error. Turning strings into character arrays seems oddly complicated. Not sure what to do at this point. – Will Aug 02 '18 at 18:39
  • 1
    @Will: Answer expanded. – user2357112 Aug 02 '18 at 18:45
  • Ah, I think I understand. So the only question I have then at this point, is should I be switching my raw data from the filehandle into an array and force the type to be a unicode null terminated string then, so I can more easily cast it into the buffer? I feel like it would be far easier if my lines were more uniform and of the same length, but that's impossible with the stuff I'm stuck working with. So, `np.copyto(X_np, np.array(raw, dtype=np.uint8)` would be perhaps the best method? – Will Aug 02 '18 at 18:56
  • The S70 version for `np.frombuffer(X, dtype='S70').reshape(X_shape)` throws `ValueError: cannot reshape array of size 11684820 into shape (11684820,70)`, and the U70 version throws `ValueError: cannot reshape array of size 2921205 into shape (11684820,70)`. So this isn't getting what we hoped for. Unless I'm missing something else? – Will Aug 07 '18 at 15:15
  • 1
    @Will: You're reshaping. You shouldn't. (That's why my example line for that version doesn't have a `reshape`.) As for the U70, I did say you need a bigger buffer. – user2357112 Aug 07 '18 at 15:22
  • ah right, it works without that, though when I get to the copy step, I can say np.copyto(X_np, ctext), where ctext is my full buffered file of data, and if I'm using the unicode option (which should be better), it's still incorrectly sized, and I get ValueError: could not broadcast input array from shape (11684820) into shape (2921205). Unicode should be better both for the way the string is parsed, and for any other processing, such as regex, etc., that I may want to do, so I assume I should be using U70 and not S70. I'm just not sure how much bigger the buffer needs to be in the general case – Will Aug 07 '18 at 15:37
  • I see it now, 4 times the size, I'm sorry I missed that. – Will Aug 07 '18 at 15:40