2

I have

import pandas as pd
import numpy as np
df = pd.DataFrame({"x": ["red", "blue", np.nan, np.nan, np.nan, np.nan, np.nan, ],
                   "y": [np.nan, np.nan, np.nan, 'cold', 'warm', np.nan, np.nan, ],
                   "z": [np.nan, np.nan,  np.nan, np.nan, np.nan, 'charm', 'strange'],
                  }).astype("category")

giving

     x     y        z
0   red   NaN      NaN
1  blue   NaN      NaN
2   NaN   NaN      NaN
3   NaN  cold      NaN
4   NaN  warm      NaN
5   NaN   NaN    charm
6   NaN   NaN  strange

I would like to add a new categorical column with unordered values red,blue,hot,cold,warm, charm, strange, filled in appropriately. I have many such columns, not just three.

Some possiblities:

  • astype(str) and concatenating and then re-creating a categorical
  • creating a new categorical type using union_categoricals and then cast each column to that type? and then serially fillna() them?

I can't make those or anything else work.

Notes: using .astype(pd.CategoricalDtype(ordered=True)) in place of .astype("category") in defining df also works with the answer below.

CPBL
  • 3,783
  • 4
  • 34
  • 44

2 Answers2

1

New Solution

For the purpose of using for a large datasets, the following solution may be more efficient:

def my_fun(x):
    m = ~ pd.isnull(x)
    if m.any():
        return x[m]
    else:
        return np.nan

df['new'] = np.apply_along_axis(my_fun, 1, df.to_numpy())

      x     y        z      new
0   red   NaN      NaN      red
1  blue   NaN      NaN     blue
2   NaN   NaN      NaN      NaN
3   NaN  cold      NaN     cold
4   NaN  warm      NaN     warm
5   NaN   NaN    charm    charm
6   NaN   NaN  strange  strange

Edited answer

As specified by the OP, in case there are rows where all values are np.NaN we could try the following solution:

df['new_col'] = df.dropna(how='all').apply(lambda x: x.loc[x.first_valid_index()], axis=1)
df['new_col'] = pd.Categorical(df.new_col)
df

      x     y        z  new_col
0   red   NaN      NaN      red
1  blue   NaN      NaN     blue
2   NaN   NaN      NaN      NaN
3   NaN  cold      NaN     cold
4   NaN  warm      NaN     warm
5   NaN   NaN    charm    charm
6   NaN   NaN  strange  strange
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    Wow! But the new column should be categorical. – CPBL Dec 09 '22 at 22:19
  • Yes I just modified. In case you have rows where all values are np.NaN. – Anoushiravan R Dec 09 '22 at 22:48
  • 1
    I found a problem in my real data: I get a "KeyError: None" error due to there being an all-NaN row. I've now modified the example by changing "hot" to np.nan. Your answer now needs a .dropna(how='all') after `df.` to work (unless you've another idea). Thank you!!! – CPBL Dec 09 '22 at 22:50
  • 1
    Btw, this method is extremely slow for my data (2496520 rows x 156 columns). I do wonder whether there is something faster. – CPBL Dec 10 '22 at 01:10
  • 1
    @CPBL I posted a new solution. Could you please try this one and let me know how it works on your data set? – Anoushiravan R Dec 10 '22 at 17:25
  • 1
    Thank you very much. This made me time it, and actually, it wasn't taking that long for one iteration (my code needs several): your first one-line code took 152 seconds. The new one takes 96 seconds. These are both fast enough, actually, and your first one is twice as elegant. Thanks!! – CPBL Dec 11 '22 at 15:16
  • That's my pleasure. I was also curious about the run time of the new one on a huge data set but glad it helped :) – Anoushiravan R Dec 11 '22 at 16:33
1

Try ffill()

df['col'] = df.ffill(axis=1).iloc[:,-1].astype('category')

or stack() with groupby()

df['col'] = df.stack().groupby(level=0).first().astype('category')

Output:

      x     y        z      col
0   red   NaN      NaN      red
1  blue   NaN      NaN     blue
2   NaN   NaN      NaN      NaN
3   NaN  cold      NaN     cold
4   NaN  warm      NaN     warm
5   NaN   NaN    charm    charm
6   NaN   NaN  strange  strange
rhug123
  • 7,893
  • 1
  • 9
  • 24
  • Thanksl `ffill()` method took 555 seconds, as compared with 96s and 152s for methods in @AnoushiravanR's answer. – CPBL Dec 11 '22 at 15:46