1

So in the screenshot below, we have 3 different energy sites, ID01, ID18, and ID31. They're in a dummy variable type of format, and for visualization purposes I want to just create a column named 'Sites' that I can use. You'll see the loop I quickly made to do this, but it seems super inefficient. Any pointers on how to achieve this in the fastest way possible?

Screenshot

Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
Justin
  • 105
  • 1
  • 8
  • 3
    If you want to maximize the chances of getting an answer, it's always best to share text-based datasets... We can't play with the data in your image without having to recreate it, which is too much of a hassle. – Julien Marrec May 14 '18 at 14:06
  • idxmax worked, thank you. – Justin May 14 '18 at 14:17

1 Answers1

5

Setup

data = pd.DataFrame([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1],
    [1, 0, 0],
    [0, 1, 0]
], columns=['ID01', 'ID18', 'ID31']).assign(A=1, B=2)

data

   ID01  ID18  ID31  A  B
0     1     0     0  1  2
1     0     1     0  1  2
2     0     0     1  1  2
3     1     0     0  1  2
4     0     1     0  1  2

dot product with strings and objects.

This works if these are truly dummy values 0 or 1

def undummy(d):
    return d.dot(d.columns)

data.assign(Site=data.filter(regex='^ID').pipe(undummy))

   ID01  ID18  ID31  A  B  Site
0     1     0     0  1  2  ID01
1     0     1     0  1  2  ID18
2     0     0     1  1  2  ID31
3     1     0     0  1  2  ID01
4     0     1     0  1  2  ID18

argmax slicing

This works but can produce unexpected results if data is not as represented in question.

def undummy(d):
    return d.columns[d.values.argmax(1)]

data.assign(Site=data.filter(regex='^ID').pipe(undummy))

   ID01  ID18  ID31  A  B  Site
0     1     0     0  1  2  ID01
1     0     1     0  1  2  ID18
2     0     0     1  1  2  ID31
3     1     0     0  1  2  ID01
4     0     1     0  1  2  ID18
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thank you for taking the time to respond so thoroughly! – Justin May 14 '18 at 14:29
  • You are welcome! Glad to help. – piRSquared May 14 '18 at 14:29
  • 1
    @JustinZint make sure to upvote the answer if it was helpful, and if it solved your problem please also mark it as accepted so we can flag the thread as resolved. – Julien Marrec May 14 '18 at 14:42
  • 1
    on a large dataframe, it is a bit faster to use [simple indexing with `np.where`](https://stackoverflow.com/a/51275990/6671176) than `dot` (I timed both methods with a 100000 row dataframe, `dot` took about 0.016 seconds on average, and `np.where` took about 0.00681 seconds on average) – sacuL Jul 11 '18 at 02:06