0

I'm new to python and want the most pythonic way of solving the following basic problem:

I have many plain-text data files file.00001, file.00002, ..., file.99999 and each file has a single line, with numeric data stored in e.g. four columns. I want to read each file sequentially and append the data into one array per column, so in the end I want the arrays arr0, arr1, arr2, arr3 each with shape=(99999,) containing all the data from the appropriate column in all the files.

Later on I want to do lots of math with these arrays so I need to make sure that their entries are contiguous in memory. My naive solution is:

import numpy as np
fnumber = 99999
fnums = np.arange(1, fnumber+1)

arr0 = np.full_like(fnums, np.nan, dtype=np.double)
arr1 = np.full_like(fnums, np.nan, dtype=np.double)
arr2 = np.full_like(fnums, np.nan, dtype=np.double)
arr3 = np.full_like(fnums, np.nan, dtype=np.double)
# ...also is there a neat way of doing this??

for fnum in fnums:
    fname = f'path/to/data/folder/file.{fnum:05}'
    arr0[fnum-1], arr1[fnum-1], arr2[fnum-1], arr3[fnum-1] = np.loadtxt(fname, delimiter=' ', unpack=True)

# error checking - in case a file got deleted or something
all_arrs = (arr0, arr1, arr2, arr3)
if np.isnan(all_arrs).any():
    print("CUIDADO HAY NANS!!!!\nLOOK OUT, THERE ARE NANS!!!!")

It strikes me that this is very C-thinking and there probably is a more pythonic way of doing it. But my feeling is that methods like numpy.concatenate and numpy.insert would either not result in arrays with their contents contiguous in memory, or involve deep copies of each array at every step in the for loop, which would probably melt my laptop.

Is there a more pythonic way?

jms547
  • 201
  • 3
  • 11
  • `all_arrs` is a tuple of arrays. `np.isnan` will turn that into one array (e.g. `np.array(all_arrs)`, returning a boolean array of shape (4,100000?). – hpaulj Nov 04 '20 at 22:16

1 Answers1

1

Try:

alist = []
for fnum in fnums:
    fname = f'path/to/data/folder/file.{fnum:05}'
    alist.append(np.loadtxt(fname))
arr = np.array(alist)
# arr = np.vstack(alist)    # alternative
print(arr.shape)

Assuming the files all have the same number of columns, one of these should work. The result will be one array, which you could separate into 4 if needed.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • How does this work? Inside the for loop you append to `alist` and then outside you recast(??) `alist` as an ndarray which takes care of the efficient memory allocation? Seems ok, except since numpy arrays are row-major aren't contiguous elements in a column 4 memory addresses apart rather than right next to each other? (Also your method stores the same data twice, once in `alist` and once in `arr` but I guess that's not a problem with only a few thousand numbers. – jms547 Nov 04 '20 at 22:07
  • `loadtxt` loads the file line by line and builds the result from the resulting list of lists. It really doesn't matter whether you collect a list of arrays, and make an array from those, or assign those arrays to slices of a predefined array. Memory use and copying is basically the same. Remember a list contains references/pointers to objects else where in memory. – hpaulj Nov 04 '20 at 22:13
  • `appending` arrays to a list is quite efficient since it just adds a reference to the list. Doing `concatenate` iteratively is slow, but doing one copy at the end, joining all arrays into one is better. But feel free to compare alternatives (on smaller problems if needed). – hpaulj Nov 04 '20 at 22:18
  • My feeling is that this solution is definitely more pythonic, but involves a bit more passing the same data around multiple times. And if I assign my `arr0` to `arr3` by slicing your `arr` then it will leave them as arrays that aren't contiguous in memory. If I `copy` them it should be ok. – jms547 Nov 05 '20 at 13:38