Maybe there's a way around this that I'm missing. Long story short, I have a need for shared memory access, read only, of a large text file. Working with strings of course is necessary. So I'm trying to do this:
import numpy
from multiprocessing import Pool, RawArray
if __name__ == '__main__':
with open('test.txt', 'r') as fin:
raw = fin.readlines()
X_shape = (len(raw), 70) # 70 characters per line should be sufficient for my needs
X = RawArray('c', X_shape[0] * X_shape[1])
X_np = np.frombuffer(X).reshape(X_shape)
numpy.copyto(X_np, raw)
This doesn't work, it fails on the second last line with this output:
ValueError: cannot reshape array of size 102242175 into shape (11684820,70)
For reference the file sample is 11684820 lines long. And 11684820 * 70 is definitely not going to be the number of characters the array claims it is sized for.
Clearly I must be doing something wrong, but this is the only method I see as feasible to multiprocess some CPU bound computations using text file inputs of text files that are several hundred megabytes on the low end and around 6 gigabytes on the high end.
Is there a work around, or perhaps a more correct way of doing this so I can have a large array of strings in shared memory that I can work on with python code? Thanks.