0

I have a df I call low with columns containing both float and strings. I need to remove the * while preserving the exact numbers and then cast everything to float.

         Li        Be        Sc
0   12.1737    0.1045   0.03365
1  12.67865    0.1287   *0.0022
2   11.5056   0.10595   0.03115
3  11.79965    0.0945    0.0313
4   12.1499   0.10585   0.03535
5  12.35725  *0.11115   0.06515
6  *10.8227   *0.9693  *0.05815
7  11.63235   0.10275   0.02485
8  12.68735  *0.28995    0.0513
9  11.65235   0.10385    0.0188
data = {
    'Li': [12.1737, 12.67865, 11.5056, 11.79965, 12.1499, 12.35725, '*10.8227', 11.63235, 12.68735, 11.65235],
    'Be': [0.1045, 0.1287, 0.10595, 0.0945, 0.10585, '*0.11115', '*0.9693', 0.10275, '*0.28995', 0.10385],
    'Sc': [0.03365, '*0.0022', 0.03115, 0.0313, 0.03535, 0.06515, '*0.05815', 0.02485, 0.0513, 0.0188]}
low = pd.DataFrame(data)

I was able to remove the * while keeping the numbers intact by producing a mask that only selects values with * in them, casting those values to str and then using str.replace to replace the * with an empty string. The line is

low.iloc[:,2:-1] = low.iloc[:,2:-1].mask(
    low.iloc[:,2:-1].apply(pd.to_numeric, errors='coerce').isnull(),
    low.iloc[:,2:-1].astype(str).apply(lambda x: x.str.replace("*", "")))

That produces columns which look like this

   Li  Be  Sc
0   12.1737 0.1045  0.03365
1   12.67865    0.1287  0.0022
2   11.5056 0.10595 0.03115
3   11.79965    0.0945  0.0313
4   12.1499 0.10585 0.03535
5   12.35725    0.11115 0.06515
6   10.8227 0.9693  0.05815
7   11.63235    0.10275 0.02485
8   12.68735    0.28995 0.0513
9   11.65235    0.10385 0.0188

and low.info returns

Data columns (total 44 columns):
#   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
0   Date            3 non-null      object
1   Sample Name     3 non-null      object
2   Li              3 non-null      object
3   Be              3 non-null      object
4   Sc              3 non-null      object
5   V               3 non-null      object
..................
dtypes: object(44)
memory usage: 1.2+ KB

Now I need to convert the columns to float. However if I try

low.iloc[:,2:-1] = low.iloc[:,2:-1].astype(float)

That returns

Data columns (total 44 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Date            10 non-null     object
 1   Sample Name     10 non-null     object
 2   Li              10 non-null     object
 3   Be              10 non-null     object
 4   Sc              10 non-null     object
 5   V               10 non-null     object
...................
dtypes: object(44)
memory usage: 3.6+ KB

Similarly

low.iloc[:,2:-1] = low.iloc[:,2:-1].apply(pd.to_numeric, errors='coerce')

low.iloc[:,2:-1] = low.iloc[:,2:-1].apply(lambda x: x.astype(float))

and other variations of above return a dataframe of all type object however

for c in low.columns[2:-1]:
    low[c] = low[c].astype(float)

returns a df with columns of float

Data columns (total 44 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            10 non-null     object 
 1   Sample Name     10 non-null     object 
 2   Li              10 non-null     float64
 3   Be              10 non-null     float64
 4   Sc              10 non-null     float64
 5   V               10 non-null     float64
........ 
dtypes: float64(41), object(3)
memory usage: 3.6+ KB

Why does the loop work but versions of astype and apply do not?? How can I convert without looping through columns?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Gingerhaze
  • 664
  • 2
  • 5
  • 13
  • It'd help to provide a [mre]. On first glance, and not knowing Pandas super well, my first guess is that the slicing is part of the problem, but the data you've provided doesn't require slicing. On the other hand, if it turns out it's *not* part of the problem, then you can simplify the example. For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). – wjandrea Jul 20 '22 at 16:03
  • For some reason, when assigning multiple columns, pandas keeps the original `dtypes` for the columns. Notice that, if you assign to new columns (e.g. `df.loc[:, ['col1, 'col2]] = (....)` then `col1` and `col2` will have float dtypes. But, since `Sc` and `Be` already existed in your original df, pandas is keeping their dtypes (which is `object`). On the other hand, for single-column assignment, it _updates_ the dtype for that column. Sounds like a bug to me, and I would definitely open a github issue in pandas github page: https://github.com/pandas-dev/pandas/issues – rafaelc Jul 20 '22 at 16:11
  • After testing for myself, it turns out the slicing is relevant, but `.iloc` is not, and you can reproduce the issue with only one column, like `low[:] = low.astype(float)` – wjandrea Jul 20 '22 at 16:14
  • Also, it looks like you can simplify the cleaning step: `low = low.apply(lambda col: col.str.replace("*", "", regex=False))`. Or, actually, that step isn't really relevant to the problem when you could just provide us the data after that step with `low.to_dict(orient='list')`. – wjandrea Jul 20 '22 at 16:19
  • @wjandrea the problem with just using apply and str replace is that it will set all the values without a * with nan. And if I try to cast everything to a string first before doing str replace, it will add decimal places to some of the numbers. For example it will take 0.05815 and output 0.05815999999999 – Gingerhaze Jul 20 '22 at 16:27
  • @Gingerhaze Oops, I forgot the "containing both float and strings" part. My bad. – wjandrea Jul 20 '22 at 16:31
  • I added a method to create the mixed float/string df to OP's question. That can be used for testing. – BeRT2me Jul 20 '22 at 16:38
  • Does this answer your question? [I need to change the type of few columns in a pandas dataframe. Can't do so using iloc](/q/63988597/4518341). It looks like the fundamental difference is between setting *columns* using `low[]` or `low[]` vs setting *a range of values* using `low.iloc` (or for that matter, `low.loc` or `low[:]`). – wjandrea Jul 20 '22 at 17:42

1 Answers1

-1

Remove the * and convert to float.

cols = ['Li', 'Be', 'Sc'] # Or how ever else you want to id them.

df[cols] = df[cols].applymap(lambda x: float(x.replace('*', '')) if isinstance(x, str) else x).astype(float)

print(df)
print(df.dtypes)

Output:

         Li       Be       Sc
0  12.17370  0.10450  0.03365
1  12.67865  0.12870  0.00220
2  11.50560  0.10595  0.03115
3  11.79965  0.09450  0.03130
4  12.14990  0.10585  0.03535
5  12.35725  0.11115  0.06515
6  10.82270  0.96930  0.05815
7  11.63235  0.10275  0.02485
8  12.68735  0.28995  0.05130
9  11.65235  0.10385  0.01880

Li    float64
Be    float64
Sc    float64
dtype: object

On very large DataFrames, this may see better performance:

df[cols] = (df[cols].apply(lambda x: x.str.replace('*', '', regex=False).astype(float))
             .combine_first(df[cols].apply(lambda x: pd.to_numeric(x, errors='coerce'))))
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • You changed the indexing, so it's not a direct equivalent to OP's. If you set `cols = slice(None)`, it doesn't work; the dtype is still `object`. – wjandrea Jul 20 '22 at 16:33
  • 1
    I tried this and it sets all the values not containing * to nan. I'm guessing because when it tries to do str replace it doesn't recognize those values as strings – Gingerhaze Jul 20 '22 at 16:36
  • Okay, now it will work @Gingerhaze – BeRT2me Jul 20 '22 at 16:40
  • @BeRT2me okay now this works thanks! makes more sense then my attempt...I am still not sure why the cols would not cast to float after the * were removed though – Gingerhaze Jul 20 '22 at 16:51
  • @BeRT2me It's not "None columns", it's a full slice, the same as `low[:]`. You can try it for yourself and see that the data changes correctly, but the column dtypes stay the same. – wjandrea Jul 20 '22 at 17:10
  • @BeRT2me Or, ignoring the slice, if you do `low.loc[:, cols] = low.loc[:, cols].applymap(...`, the same thing happens: the data changes, but the column dtypes don't. – wjandrea Jul 20 '22 at 17:14
  • 1
    `loc` assigns values, whereas without `loc` assigns Series, so that makes sense. – BeRT2me Jul 20 '22 at 17:36