Most efficient way to un-dummy variables in Pandas DF

Question

So in the screenshot below, we have 3 different energy sites, ID01, ID18, and ID31. They're in a dummy variable type of format, and for visualization purposes I want to just create a column named 'Sites' that I can use. You'll see the loop I quickly made to do this, but it seems super inefficient. Any pointers on how to achieve this in the fastest way possible?

Screenshot

If you want to maximize the chances of getting an answer, it's always best to share text-based datasets... We can't play with the data in your image without having to recreate it, which is too much of a hassle. — Julien Marrec, May 14 '18 at 14:06

piRSquared · Accepted Answer · 2018-05-14T14:23:53.873

5

Setup

data = pd.DataFrame([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1],
    [1, 0, 0],
    [0, 1, 0]
], columns=['ID01', 'ID18', 'ID31']).assign(A=1, B=2)

data

   ID01  ID18  ID31  A  B
0     1     0     0  1  2
1     0     1     0  1  2
2     0     0     1  1  2
3     1     0     0  1  2
4     0     1     0  1  2

`dot` product with strings and objects.

This works if these are truly dummy values 0 or 1

def undummy(d):
    return d.dot(d.columns)

data.assign(Site=data.filter(regex='^ID').pipe(undummy))

   ID01  ID18  ID31  A  B  Site
0     1     0     0  1  2  ID01
1     0     1     0  1  2  ID18
2     0     0     1  1  2  ID31
3     1     0     0  1  2  ID01
4     0     1     0  1  2  ID18

`argmax` slicing

This works but can produce unexpected results if data is not as represented in question.

def undummy(d):
    return d.columns[d.values.argmax(1)]

data.assign(Site=data.filter(regex='^ID').pipe(undummy))

   ID01  ID18  ID31  A  B  Site
0     1     0     0  1  2  ID01
1     0     1     0  1  2  ID18
2     0     0     1  1  2  ID31
3     1     0     0  1  2  ID01
4     0     1     0  1  2  ID18

edited May 14 '18 at 14:23

answered May 14 '18 at 14:16

piRSquared

285,575
57
475
624

Thank you for taking the time to respond so thoroughly! – Justin May 14 '18 at 14:29
You are welcome! Glad to help. – piRSquared May 14 '18 at 14:29
1

@JustinZint make sure to upvote the answer if it was helpful, and if it solved your problem please also mark it as accepted so we can flag the thread as resolved. – Julien Marrec May 14 '18 at 14:42
1

on a large dataframe, it is a bit faster to use [simple indexing with `np.where`](https://stackoverflow.com/a/51275990/6671176) than `dot` (I timed both methods with a 100000 row dataframe, `dot` took about 0.016 seconds on average, and `np.where` took about 0.00681 seconds on average) – sacuL Jul 11 '18 at 02:06

Most efficient way to un-dummy variables in Pandas DF

1 Answers1

Setup

`dot` product with strings and objects.

`argmax` slicing

Linked

Most efficient way to un-dummy variables in Pandas DF

1 Answers1

Setup

dot product with strings and objects.

argmax slicing

Linked

`dot` product with strings and objects.

`argmax` slicing