24

I have a dataframe and a dictionary. I need to add a new column to the dataframe and calculate its values based on the dictionary.

Machine learning, adding new feature based on some table:

score = {(1, 45, 1, 1) : 4, (0, 1, 2, 1) : 5}
df = pd.DataFrame(data = {
    'gender' :      [1,  1,  0, 1,  1,  0,  0,  0,  1,  0],
    'age' :         [13, 45, 1, 45, 15, 16, 16, 16, 15, 15],
    'cholesterol' : [1,  2,  2, 1, 1, 1, 1, 1, 1, 1],
    'smoke' :       [0,  0,  1, 1, 7, 8, 3, 4, 4, 2]},
     dtype = np.int64)

print(df, '\n')
df['score'] = 0
df.score = score[(df.gender, df.age, df.cholesterol, df.smoke)]
print(df)

I expect the following output:

   gender  age  cholesterol  smoke    score
0       1   13            1      0      0 
1       1   45            2      0      0
2       0    1            2      1      5
3       1   45            1      1      4
4       1   15            1      7      0
5       0   16            1      8      0
6       0   16            1      3      0
7       0   16            1      4      0
8       1   15            1      4      0
9       0   15            1      2      0
Community
  • 1
  • 1
Roman Kazmin
  • 931
  • 6
  • 18

7 Answers7

13

Since score is a dictionary (so the keys are unique) we can use MultiIndex alignment

df = df.set_index(['gender', 'age', 'cholesterol', 'smoke'])
df['score'] = pd.Series(score)  # Assign values based on the tuple
df = df.fillna(0, downcast='infer').reset_index()  # Back to columns

   gender  age  cholesterol  smoke  score
0       1   13            1      0      0
1       1   45            2      0      0
2       0    1            2      1      5
3       1   45            1      1      4
4       1   15            1      7      0
5       0   16            1      8      0
6       0   16            1      3      0
7       0   16            1      4      0
8       1   15            1      4      0
9       0   15            1      2      0
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Nice one of `MultiIIndex`. Alternative: `df['score'] =df.set_index(['gender', 'age', 'cholesterol', 'smoke']).index.map(score).fillna(0).to_numpy()`. – Quang Hoang Oct 29 '19 at 16:43
  • 4
    @ALollz, forgive me, I love your answers but I have to speak up when I see so many upvotes on an answer like this. This answer is fine **AND** clever. But it is not great. There are too many moving parts for no great gain. In the process, you've created a new `df` via `set_index`, a new `Series` via constructor. Though you get a benefit of index alignment when you assign it to `df['score']`. Lastly, `fillna(0, downcast='infer')` gets the job done but no one should prefer this lengthy solution with the creation of many pandas objects unnecessarily. – piRSquared Oct 29 '19 at 17:00
  • Again, apologies, you have my upvote as well, I just want to guide folks to simpler answers. – piRSquared Oct 29 '19 at 17:00
  • @piRSquared I went for lunch, and was surprised this got the attention it did when I came back. I agree that it's a bit convoluted all to do something that a simple `merge` could accomplish. I figured that answer would get posted quickly so I opted for an alternative and for some reason had MultiIndices on my mind. I agree, this probably shouldn't be the accepted answer, so hopefully that doesn't happen. – ALollz Oct 29 '19 at 17:07
  • 1
    Oh I'm with you. I've answered the same many times. I'm just doing my best to serve the community (-: I trust you get my intention. – piRSquared Oct 29 '19 at 17:08
8

Using assign with a list comprehension, getting a tuple of values (each row) from the score dictionary, defaulting to zero if not found.

>>> df.assign(score=[score.get(tuple(row), 0) for row in df.values])
   gender  age  cholesterol  smoke  score
0       1   13            1      0      0
1       1   45            2      0      0
2       0    1            2      1      5
3       1   45            1      1      4
4       1   15            1      7      0
5       0   16            1      8      0
6       0   16            1      3      0
7       0   16            1      4      0
8       1   15            1      4      0
9       0   15            1      2      0

Timings

Given the variety of approaches, I though it would be interesting to compare some of the timings.

# Initial dataframe 100k rows (10 rows of identical data replicated 10k times).
df = pd.DataFrame(data = {
    'gender' :      [1,  1,  0, 1,  1,  0,  0,  0,  1,  0] * 10000,
    'age' :         [13, 45, 1, 45, 15, 16, 16, 16, 15, 15] * 10000,
    'cholesterol' : [1,  2,  2, 1, 1, 1, 1, 1, 1, 1] * 10000,
    'smoke' :       [0,  0,  1, 1, 7, 8, 3, 4, 4, 2] * 10000},
     dtype = np.int64)

%timeit -n 10 df.assign(score=[score.get(tuple(v), 0) for v in df.values])
# 223 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10 
df.assign(score=[score.get(t, 0) for t in zip(*map(df.get, df))])
# 76.8 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.assign(score=[score.get(v, 0) for v in df.itertuples(index=False)])
# 113 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 df.assign(score=df.apply(lambda x: score.get(tuple(x), 0), axis=1))
# 1.84 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
(df
 .set_index(['gender', 'age', 'cholesterol', 'smoke'])
 .assign(score=pd.Series(score))
 .fillna(0, downcast='infer')
 .reset_index()
)
# 138 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
s=pd.Series(score)
s.index.names=['gender','age','cholesterol','smoke']
df.merge(s.to_frame('score').reset_index(),how='left').fillna(0).astype(int)
# 24 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.assign(score=pd.Series(zip(df.gender, df.age, df.cholesterol, df.smoke))
                .map(score)
                .fillna(0)
                .astype(int))
# 191 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.assign(score=df[['gender', 'age', 'cholesterol', 'smoke']]
                .apply(tuple, axis=1)
                .map(score)
                .fillna(0))
# 1.95 s ± 134 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • My favorite by a bit. However, just to make sure everything stays the intended type when processing through `score.get` I'd use `itertuples` or `zip(*map(df.get, df))`... To reiterate, this is my preferred approach. – piRSquared Oct 29 '19 at 16:49
  • 1
    `df.assign(score=[score.get(t, 0) for t in zip(*map(df.get, df))])` – piRSquared Oct 29 '19 at 16:50
  • 1
    Lastly, most of what I'm writing is bluster because the hash of `1.0` is the same as the hash for `1` therefore the tuple look ups should result in the same answer regardless. Apologies @Alexander for so many comments on this but I just want people to upvote this more because... they should (-: – piRSquared Oct 29 '19 at 16:52
  • 1
    As long as you're timing, look at my suggestion. There are occasions when `.values` is expensive – piRSquared Oct 29 '19 at 17:17
  • @piRSquared Yes, `[score.get(t, 0) for t in zip(*map(df.get, df))]` is better, however, `df.assign(score=[score.get(v, 0) for i, v in df.itertuples()])` doesn't work. Let me see if I can modify it. – Alexander Oct 29 '19 at 17:22
  • Yeah I messed that up... it should've been `df.assign(score=[score.get(v, 0) for v in df.itertuples(index=False, name=None)])` or `df.assign(score=[score.get(tuple(v), 0) for i, *v in df.itertuples()])` But `tuple(v)` is not needed if we pass the `index=False` flag. – piRSquared Oct 29 '19 at 17:25
  • @piRSquared: `zip(*map(df.get, df)` is nice :) – Andy L. Oct 29 '19 at 17:51
  • 1
    @AndyL. you can even control which columns and in which order: `zip(*map(df.get, ['col2', 'col1', 'col5']))` or get tuples of a modification of `df`: `zip(*map(df.eq(1).get, df))` – piRSquared Oct 29 '19 at 17:56
  • @piRSquared: wow, it's a great logic to zip on columns. I haven't known these nifty tricks. Thanks a lot, bro :) – Andy L. Oct 29 '19 at 18:05
4

You could use map, since score is a dictionary:

df['score'] = df[['gender', 'age', 'cholesterol', 'smoke']].apply(tuple, axis=1).map(score).fillna(0)
print(df)

Output

   gender  age  cholesterol  smoke  score
0       1   13            1      0    0.0
1       1   45            2      0    0.0
2       0    1            2      1    5.0
3       1   45            1      1    4.0
4       1   15            1      7    0.0
5       0   16            1      8    0.0
6       0   16            1      3    0.0
7       0   16            1      4    0.0
8       1   15            1      4    0.0
9       0   15            1      2    0.0

As an alternative you could use a list comprehension:

df['score'] = [score.get(t, 0) for t in zip(df.gender, df.age, df.cholesterol, df.smoke)]
print(df)
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • I'd like to extent my question. Really I need to add column base on range of column value. For example, if 40 < age < 50 then then score = 4 etc... Now dictionary maps on exact some value. Same true and for other keys.... – Roman Kazmin Oct 29 '19 at 16:44
  • 1
    Add an example of what you really want – Dani Mesejo Oct 29 '19 at 16:45
  • Simple example: # Here 40 and 50, 10 and 20 are age range for which I should use score = 4(or 5) score = {(1, 40, 50, 1, 1) : 4, (0, 10, 20, 1, 3) : 5} – Roman Kazmin Oct 29 '19 at 16:46
  • 1
    @Mikola You should let know every body, although at this point I believe is better if you ask another question. – Dani Mesejo Oct 29 '19 at 16:56
4

List comprehension and map:

df['score'] = (pd.Series(zip(df.gender, df.age, df.cholesterol, df.smoke))
               .map(score)
               .fillna(0)
               .astype(int)
              )

Output:

   gender  age  cholesterol  smoke  score
0       1   13            1      0      0
1       1   45            2      0      0
2       0    1            2      1      5
3       1   45            1      1      4
4       1   15            1      7      0
5       0   16            1      8      0
6       0   16            1      3      0
7       0   16            1      4      0
8       1   15            1      4      0
9       0   15            1      2      0
9       0   15            1      2    0.0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
4

reindex

df['socre']=pd.Series(score).reindex(pd.MultiIndex.from_frame(df),fill_value=0).values
df
Out[173]: 
   gender  age  cholesterol  smoke  socre
0       1   13            1      0      0
1       1   45            2      0      0
2       0    1            2      1      5
3       1   45            1      1      4
4       1   15            1      7      0
5       0   16            1      8      0
6       0   16            1      3      0
7       0   16            1      4      0
8       1   15            1      4      0
9       0   15            1      2      0

Or merge

s=pd.Series(score)
s.index.names=['gender','age','cholesterol','smoke']
df=df.merge(s.to_frame('score').reset_index(),how='left').fillna(0)
Out[166]: 
   gender  age  cholesterol  smoke  score
0       1   13            1      0    0.0
1       1   45            2      0    0.0
2       0    1            2      1    5.0
3       1   45            1      1    4.0
4       1   15            1      7    0.0
5       0   16            1      8    0.0
6       0   16            1      3    0.0
7       0   16            1      4    0.0
8       1   15            1      4    0.0
9       0   15            1      2    0.0
BENY
  • 317,841
  • 20
  • 164
  • 234
2

May be another way would be using .loc[]:

m=df.set_index(df.columns.tolist())
m.loc[list(score.keys())].assign(
           score=score.values()).reindex(m.index,fill_value=0).reset_index()

   gender  age  cholesterol  smoke  score
0       1   13            1      0      0
1       1   45            2      0      0
2       0    1            2      1      5
3       1   45            1      1      4
4       1   15            1      7      0
5       0   16            1      8      0
6       0   16            1      3      0
7       0   16            1      4      0
8       1   15            1      4      0
9       0   15            1      2      0
anky
  • 74,114
  • 11
  • 41
  • 70
2

Simple one line solution, Use get and tuple row-wise,

df['score'] = df.apply(lambda x: score.get(tuple(x), 0), axis=1)

Above solution is assuming there are no columns other than desired ones in order. If not, just use columns

cols = ['gender','age','cholesterol','smoke']
df['score'] = df[cols].apply(lambda x: score.get(tuple(x), 0), axis=1)
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • Use of `score.get` is good. However, you should prefer a comprehension, in my opinion. See [@Alexander's](https://stackoverflow.com/a/58611923/2336654) timings. – piRSquared Oct 29 '19 at 17:04