4

I may be missing something obvious here, but I am missing a function numpy.map. What that would be is the same as Python's map function, but collect the output in a numpy array. For example, I could have an image generator genImage(i) that generates a 2D image (of size (m, n)) based on a single input, and I would like to input range(k) to my generator function and obtain a (k, m, n) array.

Currently, I would use numpy.array(list(map(genImage, range(k))), but I feel that this conversion into a list is rather inefficient (my final array is about 50 GB in size). I am thus looking for numpy.map(genImage, range(k)), which is similar to numpy.fromiter, but for multidimensional outputs of the iterator.

(I have tried np.array(map(...)), but that returns a one-element array with the map as it's only entry - here is why: Why is it required to typecast a map into a list to assign it to a pandas series?)

Is there a better way to achieve what I want? I am looking to a way that ideally, I could use with joblib.

bers
  • 4,817
  • 2
  • 40
  • 59
  • 1
    `np.frompyfunc` sort of does this. The output is an object dtype array. In my past timmings it can be up to 2x faster than the equivalent `np,array([list comprehension])`. – hpaulj Nov 02 '18 at 11:53
  • In `np,array(list(map(func...,)))` the `list` just runs the `map`, collecting values in a list. The `array` the joins those values into an array. The main time consumer is calling your function many times, not the collection mechanism. This topic comes up often, often described vectorizing or avoiding loops. – hpaulj Nov 02 '18 at 12:01
  • @hpaulj are you saying that copying some 50 GB of memory around from the list of arrays into one compound array is not a significant source of runtime I should worry about? – bers Nov 02 '18 at 14:39
  • Another common big array construction idea is to create a `zeros(k,m,n)` array, and assign the output of your function one by one `result[i,:,:] = func(i)`. – hpaulj Nov 02 '18 at 17:00
  • @hpaulj "one by one" sounds like you propose I use a loop. I feel that is not ideal performance-wise - compare https://stackoverflow.com/q/2106287/ – bers Nov 03 '18 at 17:57
  • I was suggesting something like unutbu's `numpy_all_the_way` which has a slight time advantage over the `array(list...)` alternative. He was using Py2, so didn't need to wrap `map` in `list`. – hpaulj Nov 03 '18 at 18:21
  • You could perhaps avoid copying if you can arrange for `genImage` to write its result directly into a preallocated `(k, m, n)` shaped array. This would avoid the generation of a temporary intermediary list. Among NumPy-based solutions which require `k` calls to `genImage` doing the above is *probably* the fastest possible. Sometimes, to do better than this involves vectorizing away the `for-loop` with the `k` calls to `genImage`. Sometimes it is possible to do this with NumPy, or perhaps with Cython or Numba. But to be more specific we would need to see the source code for `genImage`. – unutbu Nov 05 '18 at 22:26

1 Answers1

1

If I got you right you need column_stack that would work like this:

import numpy as np

a = np.array([[1, 2], [3, 4], [5, 6]])

a = np.column_stack((range(3), a))

a
[[0 1 2]
 [1 3 4]
 [2 5 6]]
zipa
  • 27,316
  • 6
  • 40
  • 58
  • `column_stack` is just a way of joining arrays in a list. look at its code. – hpaulj Nov 02 '18 at 12:17
  • But look at its code. It does a list comprehension on the inputs, adding a dimension to each, then uses concatenate. It's a nice function, but doesn't get around the `list(map..)` step. – hpaulj Nov 02 '18 at 14:34
  • @hpaulj you mean the `asanyarray()` step? Yes, it seems you are right! – bers Nov 03 '18 at 18:00