Removing rows with nan values in recarrays of object datatype

Question

Here is my input:

data = np.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ), ( 'a2', np.NaN, 'c2' ) ], dtype = [ ( 'A', 'O' ), ( 'B', 'O' ), ( 'C', 'O' ) ] ) . view ( np.recarray)

I want this as the output:

rec.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ) ], dtype = [ ( 'A', 'O'), ( 'B', 'O' ), ( 'C', 'O' )  ] )

I have tried:

data [ data [ 'B' ] ! = np.NaN ] . view ( np.recarray )

but it doesn't work.

data [ data [ 'A' ] ! = 'a2' ] . view ( np.recarray )

gives the desired output.

Why is this method not working for np.NaN? How do I remove rows containing np.NaN values in recarrays of object datatype? Also, ~np.isnan() doesn't work with object datatype.

There are 2 issues. `x == np.nan` is always False. `nan` does not equal anything else including `np.nan`. `np.isnan` is the correct test for `nan`. But it only works on a float `dtype` array., not on object or strings. You need to write a function that conditionally applies `isnan` and doesn't choke on strings. Then apply that iteratively to each field. — hpaulj, Jan 25 '17 at 05:51
Why does your array contain `np.nan` instead of `nan` strings? With the string value, the dtype could be `U3` (or `S3) and you could do `data['A']!='nan'` tests. `np.nan` is a special float value, that in the context of strings just gives headaches. — hpaulj, Jan 28 '17 at 23:19

hpaulj · Accepted Answer · 2017-01-28T17:42:48.233

Define a function that applies np.isnan, but does not choke on a string):

def foo(item):
    try:
        return np.isnan(item)
    except TypeError:
        return False

And use vectorize to make a function that will apply this to the elements of an array, and return a boolean array:

f=np.vectorize(foo, otypes=[bool])

With your data:

In [240]: data = np.array ( [ ( 'a2', 'b1', 'c1' ), ( 'a1', 'b1', 'c1' ), ( 'a2' , np.NaN, 'c2' ) ], dtype = [ ( 'A', 'O' ), ( 'B', 'O' ), ( 'C', 'O' ) ] )
In [241]: data
Out[241]: 
array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1'), ('a2', nan, 'c2')], 
      dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
In [242]: data['B']
Out[242]: array(['b1', 'b1', nan], dtype=object)

In [243]: f(data['B'])
Out[243]: array([False, False,  True], dtype=bool)

In [244]: data[~f(data['B'])]
Out[244]: 
array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1')], 
      dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])

==============

The simplest way to perform this test removeal over all fields is to just iterate on field names:

In [429]: data    # expanded with more nan
Out[429]: 
array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1'), ('a2', nan, 'c2'),
       ('a2', 'b1', nan), (nan, 'b1', 'c1')], 
      dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])

The f function applied to each field and collected into an array:

In [441]: np.array([f(data[name]) for name in data.dtype.names])
Out[441]: 
array([[False, False, False, False,  True],
       [False, False,  True, False, False],
       [False, False, False,  True, False]], dtype=bool)

Use any to get the columns where any item is True:

In [442]: np.any(_, axis=0)
Out[442]: array([False, False,  True,  True,  True], dtype=bool)
In [443]: data[_]    # the ones with nan
Out[443]: 
array([('a2', nan, 'c2'), ('a2', 'b1', nan), (nan, 'b1', 'c1')], 
      dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])
In [444]: data[~__]   # the ones without
Out[444]: 
array([('a2', 'b1', 'c1'), ('a1', 'b1', 'c1')], 
      dtype=[('A', 'O'), ('B', 'O'), ('C', 'O')])

(In Ipython _ and __ contain the results shown in the previous Out lines.)

tolist converts the array into a list of tuples (the records of a structured array are displayed as tuples):

In [448]: data.tolist()
Out[448]: 
[('a2', 'b1', 'c1'),
 ('a1', 'b1', 'c1'),
 ('a2', nan, 'c2'),
 ('a2', 'b1', nan),
 (nan, 'b1', 'c1')]

f as a vectorized function is able to apply foo to each element (apparently it does np.array(data.tolist(), dtype=object))

In [449]: f(data.tolist())
Out[449]: 
array([[False, False, False],
       [False, False, False],
       [False,  True, False],
       [False, False,  True],
       [ True, False, False]], dtype=bool)
In [450]: np.any(_, axis=1)
Out[450]: array([False, False,  True,  True,  True], dtype=bool)

I've never tried this combination of tolist and vectorize before. Vectorized functions iterate over their inputs, so they don't offer much of a speed advantage over explicit iterations, but for tasks like this it sure simplifies the coding.

Another possibility is to define foo to operate across the fields of a record. In fact I discovered the tolist trick when I tried to apply f to a single record:

In [456]: f(data[2])
Out[456]: array(False, dtype=bool)
In [458]: f(list(data[2]))
Out[458]: array([False,  True, False], dtype=bool)
In [459]: f(data[2].tolist())
Out[459]: array([False,  True, False], dtype=bool)

how do I drop all the rows containing NaN in one go rather than taking on one variable at a time? @hpaulj — geedee, Jan 28 '17 at 10:34
My test works with one field at a time. We'd have take a logical combination across fields, `any(f(A),f(B),etc)` or within `foo` itself. — hpaulj, Jan 28 '17 at 11:32
I've added a couple of examples of applying this test to the whole structured array. It's easier than I thought. — hpaulj, Jan 28 '17 at 17:44

Removing rows with nan values in recarrays of object datatype

1 Answers1