1

I have a selection of values coming from an experiment and I want to drop some of the lines with respect to other lines. Meaning: I measure a field, a polarization and an error of the polarization. Now the machine doing this measurement sometimes does not write values in some of those lines. So I might get: field = data[0]

field = [1,2,3,3,2,1,nan,4,1,2]
polarization = [nan, 10,230,13,123,50,102,90,45]
error = [0.1, 0.1, 0.2, 0.1, 0.1, 0.3, 0.1, 0.1, 0.4, 0.2]

Now I want to delete the first elements of field, polarization and error, because the polarization[0] value = nan. And the [6] value of all arrays because field[6] = nan.

This is how I get my data:

class DataFile(object):
    def __init__(self, filename):
        self._filename = filename


    def read_dat_file(self):
        data = np.genfromtxt(self._filename, delimiter=',', \
        usecols=(3,4,5,), skip_header=23, skip_footer=3, unpack=True, converters={\
        3: lambda x: self._conv(x), \
        4: lambda x: self._conv(x), \
        5: lambda x: self._2_conv(x)})
        return data

a = DataFile("DATFILE.DAT")
print a

The _conv functions just do some unit conversion or to write 'nan' if value is " ". I tried to do something like:

data = data[~np.isnan(data).any(axis=1)]

But then I get back one array and things got messy. My next approach was to count elements, deleting the same elements from all arrays ... and so on. Works, but it's ugly. So whats the best solution here?

xtlc
  • 1,070
  • 1
  • 15
  • 41

3 Answers3

0

Try using the mask_where command.

A (very basic) example:

y = np.array([2,1,5,2])                         # y axis
x = np.array([1,2,3,4])                         # x axis
m = np.ma.masked_where(y>5, y)                  # filter out values larger than 5
new_x = np.ma.masked_where(np.ma.getmask(m), x) # applies the mask of m on x

The nice thing is you can now apply this mask to many more arrays without going through the masking process for each of them. And it will not be as ugly as counting elements.

In your case you will probably need to go through every array, check for nan and then apply that mask on all the other arrays. Hope that helps.

red_tiger
  • 1,402
  • 3
  • 16
  • 32
0

You can iterate over rows and create a mask for rows, then use boolean indexing to get the view of rows that passed:

import numpy as np

field = [1,2,3,3,2,1,-1,4,1,2]
polarization = [-1, 10,230,13,123,50,102,90,45,1337]
error = [0.1, 0.1, 0.2, 0.1, 0.1, 0.3, 0.1, 0.1, 0.4, 0.2]

#transposition is needed to get expected row-col format
array = np.array([field, polarization, error]).T
print(array)

#create your filter function
filter = lambda row : row[0] > 0 and row[1] > 0 and row[2] > 0

#create boolean mask by applying filter
mask = np.apply_along_axis(filter, 1, array)
print(mask)

new_array = array[mask]
print(new_array)
luk32
  • 15,812
  • 38
  • 62
0

I combined another thread and red_tigers answer and I want to share it with you: Just run this function over your arrays with the data inside:

data = np.array([field, polarization, error]).T

def delete_NaN_rows(self, data):
    filter = lambda row: ~np.isnan(row[0]) and ~np.isnan(row[1]) and ~np.isnan(row[2])
    mask = np.apply_along_axis(filter, 1, data)
    clean_data = data[mask]
    return clean_data.T

i used the inverse (~) of np.isnan(#element) do identify my rows with a NaN entry and to delete them.

xtlc
  • 1,070
  • 1
  • 15
  • 41