3

I wonder how I can replace specific values when loading data from a given (csv) file with multiple columns, combining both strings and numerical values.

In the example that follows, suppose that you have a number of geographical positions, with known latitudes and longitudes and a specific set of properties (P1-P5) and a class (just to include the string component of the problem). There are some missing values which are properly replaced by genfromtxt (missing value in this case is -999) and there are, additionally, values that are not correct (fake, or other kinds of flags) such as 0.0. How can we replace 0.0 to -999 ?

Data:

Name,lat,long,P1,P2,P3,P4,P5,Class
id1,71.234,10.123,0.0,11,212,222,1920,A
id2,72.234,11.111,,,312,342,1920,A
id3,77.832,12.111,1,0.0,,333,4520,B
id4,77.987,12.345,3,0.0,,231,2020,B
id5,77.111,13.099,5,11,212,222,1920,A

And the code so far:

dfile = "data.csv"
missing_value = -999

import numpy as np

data = np.genfromtxt(dfile, unpack=True, comments='#', names=True, 
                    autostrip='Yes', filling_values=missing_value,
                    dtype=('S5', 'float', 'float', 'float', 'float', 'float', 'float', 'S1')
                    , delimiter=',',
                    )
new_data = np.where(data!=0.0 ,data, -999)

I have used the np.where as in np.where(data!=0.0 ,data, -999) but I got an error:

TypeError: invalid type promotion

I do not know what I am missing...

ps 1. Perhaps it is solvable with pandas but I am looking for an independent solution

ps 2. I know that a dirty workaround would be to set the incorrect values (of 0.0s) as my missing flag in the initial file, but what is there are multiple values that we would like to exclude ? (or combining data with different flags)

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
gmaravel
  • 357
  • 1
  • 3
  • 12
  • The problem seems to be that the data type of your data is a numpy array of numpy voids rather than a numpy array. – Nathan Jul 01 '19 at 13:21
  • Without the `unpack` this array would be a structured array. The 'columns' are actually `fields`, addressed by name. Value replacement has to be done field by field, – hpaulj Jul 01 '19 at 14:47
  • OK, the `unpack` does nothing. Practice looking at your array, e.g. `data['P1']`, `data['class']`. Look at those fields, and try to modify them. – hpaulj Jul 01 '19 at 15:50

2 Answers2

1

Define a simple text:

In [55]: txt= '''foo,bar,test 
    ...: a,1,2 
    ...: b,3,4 
    ...: ''' 

load with genfromtxt:

In [60]: data = np.genfromtxt(txt.splitlines(), encoding=None, names=True, dtype=None, delimiter=',')           
In [61]: data                                                                                                   
Out[61]: 
array([('a', 1, 2), ('b', 3, 4)],
      dtype=[('foo', '<U1'), ('bar', '<i8'), ('test', '<i8')])

Note the dtype - fields with different dtype and names.

Access fields by name:

In [64]: data['foo']                                                                                            
Out[64]: array(['a', 'b'], dtype='<U1')

Modify one field by index:

In [65]: data['bar']                                                                                            
Out[65]: array([1, 3])
In [66]: data['bar'][0] = 23                                                                                    

Modify another with boolean test (or where):

In [67]: test = data['test']                                                                                    
In [68]: test                                                                                                   
Out[68]: array([2, 4])
In [69]: test==2                                                                                                
Out[69]: array([ True, False])
In [70]: test[test==2]=0                                                                                        
In [71]: test                                                                                                   
Out[71]: array([0, 4])
In [72]: data                                                                                                   
Out[72]: 
array([('a', 23, 0), ('b',  3, 4)],
      dtype=[('foo', '<U1'), ('bar', '<i8'), ('test', '<i8')])

Replacement might be easier if you grouped the numeric fields into one (but that requires more understanding of structured array dtypes):

In [80]: data = np.genfromtxt(txt.splitlines(), encoding=None, skip_header=1, dtype=[('id','U3'),('foo',int,2)],
    ...:  delimiter=',')                                                                                        
In [81]: data                                                                                                   
Out[81]: 
array([('a', [1, 2]), ('b', [3, 4])],
      dtype=[('id', '<U3'), ('foo', '<i8', (2,))])
In [82]: data['foo']                                                                                            
Out[82]: 
array([[1, 2],
       [3, 4]])
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Grouping can be a solution to replace the values, but they we lose the field information (inside the script I am calling different combinations of P1, P2, ... , so I would really like to be able to call them like: data['P1'] + data['P4']) – gmaravel Jul 02 '19 at 08:40
  • By examining each field separately and changing individually the incorrect values, we keep the field names. So it seems to be the best option here, although I do not find it very elegant...My implementation would be: `for p in ['P1','P2','P3','P4','P5']: for i in range(len(data[p])): if data[p][i]==0.0: data[p][i] = missing_value` – gmaravel Jul 02 '19 at 08:59
0

It seems to me the problem is with the np.genfromtxt part. It creates a numpy array of the form:

np.array([np.void, np.void ... ])

Which causes the np.where to fail. One way of working around this is:

data = np.array([[i for i in j] for j in data])

I don't think this is a very nice solution. But it should work until someone comes along with a real answer.

Nathan
  • 3,558
  • 1
  • 18
  • 38
  • `np.void` is the type of a structured array element. The array dtype shows its internal structure. – hpaulj Jul 02 '19 at 03:36
  • Although this work nicely with np.where afterwards it doesn't keep the names of the fields - that are callable later in the script. – gmaravel Jul 02 '19 at 08:32