4

I have a text file made as:

0.01 1 0.1 1 10 100 a
0.02 3 0.2 2 20 200 b
0.03 2 0.3 3 30 300 c
0.04 1 0.4 4 40 400 d

I read it as a list A and then converted to a numpy array, that is:

>>> A
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b'],
       ['0.03', '2', '0.3', '3', '30', '300', 'c'],
       ['0.04', '1', '0.4', '4', '40', '400', 'd']], 
      dtype='|S4')

I just want to extract a sub-array B, made of A wherever its 4th entry is lower than 30, that should look something like:

B = array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
           ['0.02', '3', '0.2', '2', '20', '200', 'b']])

When dealing with arrays, I usually do simply B = A[A[:,4]<30], but in this case (maybe due to the presence of characters/strings I've never worked with) it doesn't work, giving me this:

>>> A[A[:,4]<30]
array(['0.01', '1', '0.1', '1', '10', '100', 'a'], 
      dtype='|S4')

and I can't figure out the reason. I'm not dealing with a code of mine and I don't think I can switch all this to structures or dictionaries: any suggestion for doing this with numpy arrays? Thank you very much in advance!

urgeo
  • 645
  • 1
  • 9
  • 19

2 Answers2

3

You have to compare int to int

A[A[:,4].astype(int)<30]

or str to str

A[A[:,4]<'30'] 

However, notice that the latter would work in your specific example, but won't work generally because you are comparing str ordering (for example, '110' < '30' returns True, but 110 < 30 returns False)


numpy will infer your elements' types from your data. In this case, it attributed the type = '|S4' to your elements, meaning they strings of length 4. This is probably a consequence of the underlying C code (which enhances numpy's performance) that requires elements to have fixed types.

To illustrate this difference, check the following code:

>>> np.array([['0.01', '1', '0.1', '1', '10', '100', 'a']])
array(['0.01', '1', '0.1', '1', '10', '100', 'a'], dtype='|S4')

The inferred type of strings of length 4, which is the max length of your elements (in elem 0.01). Now, if you expclitily define it to hold general type objects, it will do what you want

>>> np.array([[0.01, 1, 0.1, 1, 10, 100, 'a']], dtype=object)
array([0.01, 1, 0.1, 1, 10, 100, 'a'], dtype=object)

and your code A[A[:,4]<30] would work properly.

For more information, this is a very complete guide

rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • But when I deal with the file, I read them as integer and float, why do they become strings when I pass to a numpy array? – urgeo Apr 29 '18 at 19:10
  • It converts to `str` because your arrays have elements with different type s. `NumPy` tries to infer which are the types of your elements – rafaelc Apr 29 '18 at 19:15
  • Omg, I didn't notice that my array was made of strings! When I read the file I create a list of lists and I read each entry as integer, float, or string. I don't get why numpy changes them all to strings... – urgeo Apr 29 '18 at 19:17
1
In [86]: txt='''0.01 1 0.1 1 10 100 a
    ...: 0.02 3 0.2 2 20 200 b
    ...: 0.03 2 0.3 3 30 300 c
    ...: 0.04 1 0.4 4 40 400 d'''
In [87]: A = np.genfromtxt(txt.splitlines(), dtype=str)
In [88]: A
Out[88]: 
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b'],
       ['0.03', '2', '0.3', '3', '30', '300', 'c'],
       ['0.04', '1', '0.4', '4', '40', '400', 'd']], dtype='<U4')
In [89]: A[:,4]
Out[89]: array(['10', '20', '30', '40'], dtype='<U4')

genfromtxt, as a default tries to make floats. But in that case the character column would be nan. Instead I specified str dtype.

So a numeric test would require converting the column to numbers:

In [90]: A[:,4].astype(int)
Out[90]: array([10, 20, 30, 40])
In [91]: A[:,4].astype(int)<30
Out[91]: array([ True,  True, False, False])

In this case a string comparison also works:

In [99]: A[:,4]<'30'
Out[99]: array([ True,  True, False, False])

Or if we use dtype=None, it infers dtype by column and makes a structured array:

In [93]: A1 = np.genfromtxt(txt.splitlines(), dtype=None,encoding=None)
In [94]: A1
Out[94]: 
array([(0.01, 1, 0.1, 1, 10, 100, 'a'), (0.02, 3, 0.2, 2, 20, 200, 'b'),
       (0.03, 2, 0.3, 3, 30, 300, 'c'), (0.04, 1, 0.4, 4, 40, 400, 'd')],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U1')])

Now we can select a field by name, and test it:

In [95]: A1['f4']
Out[95]: array([10, 20, 30, 40])

Either way we can select rows based on the True/False mask or the corresponding row indices:

In [96]: A[[0,1],:]
Out[96]: 
array([['0.01', '1', '0.1', '1', '10', '100', 'a'],
       ['0.02', '3', '0.2', '2', '20', '200', 'b']], dtype='<U4')

In [98]: A1[[0,1]]     # A1 is 1d
Out[98]: 
array([(0.01, 1, 0.1, 1, 10, 100, 'a'), (0.02, 3, 0.2, 2, 20, 200, 'b')],
      dtype=[('f0', '<f8'), ('f1', '<i8'), ('f2', '<f8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8'), ('f6', '<U1')])
hpaulj
  • 221,503
  • 14
  • 230
  • 353