I have a two-dimensional numpy.ndarray
of floats. Each row is to be converted to a string consisting of 1
s and 0
s, reflecting whether the elements of the row satisfy a given property or not.
In this question, I will show my approach (which works), then explain why I find it unsatisfactory, and then ask for your advice.
My approach so far:
import numpy as np
threshold = 0.1
## This array serves as an example. In my actual code, it
## is bigger: something like shape=(30000, 5).
## That is, 30000 rows and 5 columns. Both numbers will
## from case to case.
test_array = np.array(
[[0.5,0.2,0.0,0.0,0.3],
[0.8,0.0,0.0,0.0,0.2],
[0.8,0.0,0.1,0.0,0.1],
[1.0,0.0,0.0,0.0,0.0],
[0.9,0.0,0.0,0.1,0.0],
[0.1,0.0,0.0,0.8,0.1],
[0.0,0.1,0.0,0.0,0.9],
[0.0,0.0,0.0,0.0,1.0],
[0.0,0.0,0.5,0.5,0.0],
],
dtype=float
)
## Now comes the conversion in two steps.
test_array_2 = np.where(test_array > threshold, '1', '0')
test_array_3 = np.apply_along_axis(''.join, 1, test_array_2)
The in-between result test_array_2
eveluates to
array([['1', '1', '0', '0', '1'],
['1', '0', '0', '0', '1'],
['1', '0', '0', '0', '0'],
['1', '0', '0', '0', '0'],
['1', '0', '0', '0', '0'],
['0', '0', '0', '1', '0'],
['0', '0', '0', '0', '1'],
['0', '0', '0', '0', '1'],
['0', '0', '1', '1', '0']], dtype='<U1')
and test_array_3
evaluates to
array(['11001', '10001', '10000', '10000', '10000', '00010', '00001',
'00001', '00110'], dtype='<U5')
test_array_3
is the result I want.
Why this is unsatisfactory:
I dislike my use of the str.join()
method. Maybe it is because I'm unexperienced, but it feels like it makes the code less readable.
Also (maybe the more important point), the function np.apply_along_axis
is not efficient. It would be better to vectorize the computation, right?
Question:
Is the use of str.join()
a bad choice and, if so, what other methods are there?
Is there a good way to vectorize the computation?