0

I'm using numpy genfromtxt, and I need to identify both missing data and bad data. Depending on user input, I may want to drop bad value or raise error. Essentially, I want to treat missing and bad data as the same thing.

Say I have a file like this, where the columns are of data types "date, int, float"

date,id,value
2017-12-4,0,       # BAD. missing data
2017-12-4,1,XYZ    # BAD. value should be float, not string. 
2017-12-4,2,1.0    # good
2017-12-4,3,1.0    # good
2017-12-4,4,1.0    # good

I would like to detect both. So, I do this

dtype=(np.dtype('<M8[D]'), np.dtype('int64'), np.dtype('float64'))
result = np.genfromtxt(filename, delimiter=',', dtype=dtype, names=True, usemask=True, usecols=('date', 'id', 'value'))

And the result is this

masked_array(data=[(datetime.date(2017, 12, 4), 0, --),
               (datetime.date(2017, 12, 4), 1, nan),
               (datetime.date(2017, 12, 4), 2, 1.0),
               (datetime.date(2017, 12, 4), 3, 1.0),
               (datetime.date(2017, 12, 4), 4, 1.0)],
         mask=[(False, False,  True), (False, False, False),
               (False, False, False), (False, False, False),
               (False, False, False)],
   fill_value=('NaT', 999999, 1.e+20),
        dtype=[('date', '<M8[D]'), ('id', '<i8'), ('value', '<f8')])

I thought the whole point of masked_array is that it can handle missing data AND bad data. But here, it's only handling missing data.

result['value'].mask

returns

array([ True, False, False, False, False])

The "bad" data actually still got into the array, as nan. I was hoping the mask would give me True True False False False.

In order for me to realize we have a bad value on the 2nd row, I need to do additional work, like check for nan.

another_mask = np.isnan(result['value'])
good_result = result['value'][~another_mask]

Finally, this returns

masked_array(data=[1.0, 1.0, 1.0],
         mask=[False, False, False],
   fill_value=1e+20)

That works, but I feel like I'm doing something wrong. The whole point of maskedArray is to find missing AND bad data, but I'm somehow only using it to find missing data. And I need my own check to find bad data. Feels ugly and not-pythonic.

Is there a way to find both at the same time?

user3240688
  • 1,188
  • 3
  • 13
  • 34
  • I don't like that 'not-pythonic' notion. If it works, it is pythonic. Now whether it's making full use of a complex function like `genfromtxt` is another matter. I can't reference it right now, but not too long ago I wrestled with this in another SO. The function seems to be good at filling missing values. It's less clear whether the masked part is fully functional, or even what it's supposed to do. It isn't well documented – hpaulj Dec 15 '20 at 00:22
  • How about filling with `nan`, and then use `np.ma.masked_invalid` to mask all `nan`. You don't get any brownie points for doing everything in `genfromtxt`. – hpaulj Dec 15 '20 at 00:24
  • https://stackoverflow.com/questions/64971218/numpy-genfromtxt-not-applying-missing-values (may be duplicate) – hpaulj Dec 15 '20 at 00:57
  • When you create a `MaskedArray` you have to specify what is to masked. That class does not have a default masking criteria. `genfrontxt` does not document its use of that flag. – hpaulj Dec 15 '20 at 01:35
  • filling with `nan` is my last resort. The reason I don't like `nan` is that `nan` may be an actual number. But a string and empty space, are both examples of bad values. I'd like to distinguish between `nan` and bad_values. That's why I thought masked_array would be idea. – user3240688 Dec 15 '20 at 15:48

1 Answers1

0

Playing around with a simple input:

In [143]: txt='''1,2
     ...: 3,nan
     ...: 4,foo
     ...: 5,
     ...: '''.splitlines()
In [144]: txt
Out[144]: ['1,2', '3,nan', '4,foo', '5,']

By specifying a specific string as 'missing' (it may be a list?), I can 'mask' it, along with blank:

In [146]: np.genfromtxt(txt,delimiter=',', missing_values='foo', 
       usemask=True, usecols=1)
Out[146]: 
masked_array(data=[2.0, nan, --, --],
             mask=[False, False,  True,  True],
       fill_value=1e+20)

It looks like it converted all values with float, but generated the mask based on the strings (or lack there of):

In [147]: _.data
Out[147]: array([ 2., nan, nan, nan])

I can replace both types of 'missing' with a specific value. Since it's doing a float conversion, the fill has to be 100 or '100':

In [151]: np.genfromtxt(txt,delimiter=',', missing_values='foo', 
    usecols=1, filling_values=100)
Out[151]: array([  2.,  nan, 100., 100.])

In a more complex case I can imagine writing a converter for the column. I've only dabbled in that feature.

The documentation for these parameters is slim, so figuring out what combinations work, and in what order, is a matter of trial-and-error (or a lots of code digging).

More details in the follow up question: numpy genfromtxt - how to detect bad int input values

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • thanks. what is your advice for non-floats? Say someone put in a string when the correct format is an int? Right now, the entire read would crash. I was hoping to get back a maskedarray with the bad entry having mask of True. – user3240688 Dec 15 '20 at 21:49