1

I have a two-dimensional numpy.ndarray of floats. Each row is to be converted to a string consisting of 1s and 0s, reflecting whether the elements of the row satisfy a given property or not.
In this question, I will show my approach (which works), then explain why I find it unsatisfactory, and then ask for your advice.

My approach so far:

import numpy as np

threshold = 0.1

## This array serves as an example. In my actual code, it
## is bigger: something like shape=(30000, 5).
## That is, 30000 rows and 5 columns. Both numbers will
## from case to case.
test_array  = np.array(
                [[0.5,0.2,0.0,0.0,0.3],
                 [0.8,0.0,0.0,0.0,0.2],
                 [0.8,0.0,0.1,0.0,0.1],
                 [1.0,0.0,0.0,0.0,0.0],
                 [0.9,0.0,0.0,0.1,0.0],
                 [0.1,0.0,0.0,0.8,0.1],
                 [0.0,0.1,0.0,0.0,0.9],
                 [0.0,0.0,0.0,0.0,1.0],
                 [0.0,0.0,0.5,0.5,0.0],
                ],
                dtype=float
            )

## Now comes the conversion in two steps.
test_array_2 = np.where(test_array > threshold, '1', '0')
test_array_3 = np.apply_along_axis(''.join, 1, test_array_2)

The in-between result test_array_2 eveluates to

array([['1', '1', '0', '0', '1'],
       ['1', '0', '0', '0', '1'],
       ['1', '0', '0', '0', '0'],
       ['1', '0', '0', '0', '0'],
       ['1', '0', '0', '0', '0'],
       ['0', '0', '0', '1', '0'],
       ['0', '0', '0', '0', '1'],
       ['0', '0', '0', '0', '1'],
       ['0', '0', '1', '1', '0']], dtype='<U1')

and test_array_3 evaluates to

array(['11001', '10001', '10000', '10000', '10000', '00010', '00001',
       '00001', '00110'], dtype='<U5')

test_array_3 is the result I want.

Why this is unsatisfactory: I dislike my use of the str.join() method. Maybe it is because I'm unexperienced, but it feels like it makes the code less readable.
Also (maybe the more important point), the function np.apply_along_axis is not efficient. It would be better to vectorize the computation, right?

Question: Is the use of str.join() a bad choice and, if so, what other methods are there?
Is there a good way to vectorize the computation?

NerdOnTour
  • 634
  • 4
  • 15
  • You say: `In my actual code, it` `is bigger.` What are the actual dimensions? – Armali Nov 20 '21 at 19:51
  • 1
    @Armali It will be roughly 30.000 rows and about 5 columns. So the strings I want to get will have a length similar to length 5, as indicated in my question. The number of strings will be much larger. – NerdOnTour Nov 22 '21 at 08:05

1 Answers1

0

I don't know whether you find this satisfactory or readable, but at least it doesn't use str.join nor np.apply_along_axis:

width = test_array.shape[1]
bytes = np.packbits(test_array > threshold, 1) >> 8-width
array = np.frompyfunc(np.binary_repr, 2, 1)(bytes.flatten(), width)
Armali
  • 18,255
  • 14
  • 57
  • 171
  • 1
    Thanks for the the idea of using `np.packbits`. Where can I find documentation on the `>>`? I didn't find any so far. – NerdOnTour Nov 22 '21 at 13:01
  • The documentation of [`numpy.right_shift`](https://numpy.org/doc/stable/reference/generated/numpy.right_shift.html) says: _The `>>` operator can be used as a shorthand for `np.right_shift` on ndarrays._ – Armali Nov 22 '21 at 16:50