0

I'm trying to use genfromtxt to extract a csv file that contains missing values such as 'na' and '-' I am required to look for the minimum value in the list of data, but the missing values got returned as -1.

this is my code:

data = np.genfromtxt('median-resale-prices-for-registered-applications-by-town-and-flat-type.csv',
                     skip_header=1,
                     dtype=[('quarter', 'U7'),  ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')], delimiter=",",
                     missing_values=['na','-'], filling_values=[0])

min_price = np.min(data['price'])
print(min_price)

and this is what i have in return

-1

i have also tried isnan()

print("Original data: " + str(data.shape))
null_rows = np.isnan(data['price'])
print(null_rows)
nonnull_values = data[null_rows==False]
print("Filtered data: " + str(nonnull_values.shape))

however, python did not perceive the na and - values as isnan Original data: (9360,) [False False False ... False False False] Filtered data: (9360,)

is there something wrong with my code?

Stefan Wuebbe
  • 2,109
  • 5
  • 17
  • 28
Yue Min
  • 11
  • 3
  • without the missing/filled parameters what values do you get in the `price` column? Without a sample of the file it is hard to offer much of a diagnosis. I'd have create a sample based on what I imagine yours is like. – hpaulj Jan 01 '23 at 21:31
  • How do you know the filled values are -1? Did you actually compare `data` with the csv? Or are you just assuming that's what's happening from this min? – hpaulj Jan 01 '23 at 21:40
  • quarter,town,flat_type,price 2007-Q2,ANG MO KIO,1-ROOM,na 2007-Q2,ANG MO KIO,2-ROOM,- 2007-Q2,ANG MO KIO,3-ROOM,172000 2007-Q2,ANG MO KIO,4-ROOM,260000 2007-Q2,ANG MO KIO,5-ROOM,372000 – Yue Min Jan 02 '23 at 02:14
  • im assumed that what's happening since the returned value is +1, i tried changing to the price dtype to f8 but now both the max and min got returned as NaN – Yue Min Jan 02 '23 at 02:18
  • The sample would be easier to use if added to the question. In the comment newlines are lost. Show your `data` with that sample. `np.nan` is a float, so can't appear in a `i8`. Don't assume anything. I haven't used fill and missing_values much, so really need to pull up the docs. And experiement. – hpaulj Jan 02 '23 at 02:37

1 Answers1

1

With the sample, adapted from the comment:

In [26]: txt1="""quarter,town,flat_type,price
    ...: 2007-Q2,ANG MO KIO,1-ROOM,na 
    ...: 2007-Q2,ANG MO KIO,2-ROOM,- 
    ...: 2007-Q2,ANG MO KIO,3-ROOM,172000 
    ...: 2007-Q2,ANG MO KIO,4-ROOM,260000 
    ...: 2007-Q2,ANG MO KIO,5-ROOM,372000"""

Accepting that last field as floats (no fill stuff):

In [27]: data = np.genfromtxt(txt1.splitlines(),
    ...:                      skip_header=1,
    ...:                      dtype=[('quarter', 'U7'),  ('town', 'U50'), ('flat_type', 'U10'), ('price', 'f8')], delimiter=",")

In [28]: data
Out[28]: 
array([('2007-Q2', 'ANG MO KIO', '1-ROOM',     nan),
       ('2007-Q2', 'ANG MO KIO', '2-ROOM',     nan),
       ('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.),
       ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.),
       ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])

genfromtxt normally uses nan for values it can't parse as floats.

For integer, it apparently uses -1 instead:

In [29]: data = np.genfromtxt(txt1.splitlines(),
    ...:                      skip_header=1,
    ...:                      dtype=[('quarter', 'U7'),  ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')], delimiter=",")

In [30]: data
Out[30]: 
array([('2007-Q2', 'ANG MO KIO', '1-ROOM',     -1),
       ('2007-Q2', 'ANG MO KIO', '2-ROOM',     -1),
       ('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000),
       ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000),
       ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<i8')])

After some fiddling, I got this to work. The key was to use a single value of filling_values, not a list.

In [54]: data = np.genfromtxt(txt1.splitlines(),
    ...:                      skip_header=1,
    ...:                      dtype=[('quarter', 'U7'), ('town', 'U50'), ('flat_type', 'U10'), ('price', 'i8')], 
    ...:                      delimiter=",", 
    ...:                      missing_values=['na','-'], 
    ...:                      filling_values=-999)

In [55]: data
Out[55]: 
array([('2007-Q2', 'ANG MO KIO', '1-ROOM',   -999),
       ('2007-Q2', 'ANG MO KIO', '2-ROOM',   -999),
       ('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000),
       ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000),
       ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<i8')])

Looking at the code (via [source] in the docs), I see we can use dict, specifying different values for different columns. Thus

missing_values={3:['na','-']}, 
filling_values={3:-999})

There are more details in the code than in the documentation. I haven't used these values much, so each time I have learn more.

edit

In today's question you seem to have forgotten all that you learned here. data is a structured array. With f8 you get nan for the missing values, not -1. And you attempt to treat the array as a list of tuples. Why not continue to treat it as a structured array?

The pricefield:

In [104]: data['price']
Out[104]: array([    nan,     nan, 172000., 260000., 372000.])

In [106]: mask = np.isnan(data['price'])   # or test against -1 for ints
In [107]: mask
Out[107]: array([ True,  True, False, False, False])

Use the mask to select, or "delete" elements from the 1d array:

In [108]: data[mask]
Out[108]: 
array([('2007-Q2', 'ANG MO KIO', '1-ROOM', nan),
       ('2007-Q2', 'ANG MO KIO', '2-ROOM', nan)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])

In [109]: data[~mask]
Out[109]: 
array([('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.),
       ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.),
       ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.)],
      dtype=[('quarter', '<U7'), ('town', '<U50'), ('flat_type', '<U10'), ('price', '<f8')])

But here's a list based approach:

In [110]: alist = data.tolist()
In [111]: alist
Out[111]: 
[('2007-Q2', 'ANG MO KIO', '1-ROOM', nan),
 ('2007-Q2', 'ANG MO KIO', '2-ROOM', nan),
 ('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.0),
 ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.0),
 ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.0)]

In [112]: [i for i in alist if not np.isnan(i[3])]
Out[112]: 
[('2007-Q2', 'ANG MO KIO', '3-ROOM', 172000.0),
 ('2007-Q2', 'ANG MO KIO', '4-ROOM', 260000.0),
 ('2007-Q2', 'ANG MO KIO', '5-ROOM', 372000.0)]
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • thank you so much for helping, i changed the price dtype to f8 and used np.nanmax and np.nanmin to find the maximum and minimum values, and it worked fine – Yue Min Jan 02 '23 at 16:37