4

So I have an excel file that looks like this

    Name    R   s   l2  max_amplitude   ref_amplitude
    R_0.3_s_0.5_l2_0.1  0.3 0.5 0.1 1.45131445  1.45131445
    R_0.3_s_0.5_l2_0.6  0.3 0.5 0.6 3.52145743  3.52145743
   ...
    R_1.1_s_2.0_l2_1.6  1.1 2.0 1.6 5.07415199  5.07415199
    R_1.1_s_2.0_l2_2.1  1.1 2.0 2.1 5.78820419  5.78820419
    R_1.1_s_2.0_l2_2.6  1.1 2.0 2.6 5.84488964  5.84488964
    R_1.1_s_2.0_l2_3.1  1.1 2.0 3.1 6.35387516  6.35387516

Using the pandas module I import the data into data frame

import pandas as pd
df = pd.read_excel("output_var.xlsx", header=0)

Everything seems to be ok:

df

in the command line produces:

       R    s   l2  max_amplitude  ref_amplitude
0    0.3  0.5  0.1       1.451314       1.451314
1    0.3  0.5  0.6       3.521457       3.521457
2    0.3  0.5  1.1       4.770226       4.770226
...
207  1.1  2.0  2.1       5.788204       5.788204
208  1.1  2.0  2.6       5.844890       5.844890
209  1.1  2.0  3.1       6.353875       6.353875

[210 rows x 5 columns]

Now I need to do some calculations based on the value of R so I need to do slicing of the array. Column R containes 5 different values: 0.3, 0.5, 0.7, 0.9 and 1.1. Each of these 5 values has 42 rows. (5x42=210) To remove the duplicates from "R" I try

set(df.R)

which returns:

{0.29999999999999999,
 0.5,
 0.69999999999999996,
 0.89999999999999991,
 0.90000000000000002,
 1.1000000000000001}

Beside from representing the 0.3 as 0.29999 etc. there are 6 (instead of 5) different values for R. It seams that sometimes 0.9 gets interpreted as 0.89999999999999991 and sometimes as 0.90000000000000002 This can be (partialy) solved with:

set(round(df.R,1))

which (at least) returns 5 values:

{0.29999999999999999,
 0.5,
 0.69999999999999996,
 0.90000000000000002,
 1.1000000000000001}

But now I come to the dangerous part. If I want to do the slicing according to the known values of R (0.3, 0.5, 0.7, 0.9 and 1.1)

len(df[df.R==0.3])

returns

42

and

len(df[df.R==0.9])

returns

41

One value gets deleted by Python! (remember, there are 42 rows for each of 5 R's giving the total number of 210 rows in the file). How to deal with this problem?

ayhan
  • 70,170
  • 20
  • 182
  • 203
Mato
  • 81
  • 5

1 Answers1

4

Don't check floats for equality. There are some issues with floating point arithmetic (check here for example).

Instead, check for closeness (really really closeness):

import numpy as np
len(df[np.isclose(df.R, 0.9)])

Normally, if you don't convert the series to a set, pandas would handle that. So if you want to drop duplicates, I'd suggest using pandas methods:

df.drop_duplicates('R')
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • thank you for your prompt reply, but pandas 'drop_duplicates' does not solve the floating point problem: 'df.R.drop_duplicates()' yields: ' 0.3 0.5 0.7 0.9 0.9 1.1 ' So I still have two "different" 0.9 values – Mato Jul 12 '16 at 10:33
  • @Mato Are you sure they are the same in the Excel file? Can you share the file (or a portion of it) if possible? – ayhan Jul 12 '16 at 10:36
  • In excel the data for R=0.9 does look identical but I just tried to quickly import the data into Mathematica and the same problem occured (the two 0.9 values). So it seems that the problems is in the data (though the Excel file was generated with a pandas 'to_excel' command). I will have to search there for the problem. – Mato Jul 12 '16 at 10:52