3

Actually,My problem is based on the :

Is there a faster way to update dataframe column values based on conditions?

So,the data should be:

import pandas as pd
import io
t="""
AV4MdG6Ihowv-SKBN_nB    DTP,FOOD
AV4Mc2vNhowv-SKBN_Rn    Cash 1,FOOD
AV4MeisikOpWpLdepWy6    DTP,Bar
AV4MeRh6howv-SKBOBOn    Cash 1,FOOD
AV4Mezwchowv-SKBOB_S    DTOT,Bar
AV4MeB7yhowv-SKBOA5b    DTP,Bar
"""
data_vec=pd.read_csv(io.StringIO(t),sep='\s{2,}',names=['id','source'])
data_vec

This is the data_vec:

    id  source
0   AV4MdG6Ihowv-SKBN_nB    DTP,FOOD
1   AV4Mc2vNhowv-SKBN_Rn    Cash 1,FOOD
2   AV4MeisikOpWpLdepWy6    DTP,Bar
3   AV4MeRh6howv-SKBOBOn    Cash 1,FOOD
4   AV4Mezwchowv-SKBOB_S    DTOT,Bar
5   AV4MeB7yhowv-SKBOA5b    DTP,Bar

If I want the result like follow:(It means how to vectorize the mutipletags or categories ?)

                  _id  source_Cash 1  source_DTOT  source_DTP  Food  Bar
0  AV4MdG6Ihowv-SKBN_nB              0            0        1      1    0
1  AV4Mc2vNhowv-SKBN_Rn              1            0        0      1    0
2  AV4MeisikOpWpLdepWy6              0            0        1      0    1
3  AV4MeRh6howv-SKBOBOn              1            0        0      1    0
4  AV4Mezwchowv-SKBOB_S              0            1        0      0    1
5  AV4MeB7yhowv-SKBOA5b              0            0        1      0    1

If it is duplicate, warn me to delete!

cs95
  • 379,657
  • 97
  • 704
  • 746
ileadall42
  • 631
  • 2
  • 7
  • 19
  • 1
    probably a duplicate, but I dunno. anyway, `get_dummies` takes a separator which wiil deal with mulitple values: `data_vec.source.str.get_dummies(sep=',')` – JohnE Oct 11 '17 at 03:50
  • 1
    @JohnE Thanks for your comment, I learned something new! Didn't know about the separator. – cs95 Oct 11 '17 at 04:00
  • Me too!@JohnE @COLDSPEED – ileadall42 Oct 11 '17 at 04:01
  • @OP, it would be nice if you could close [this question](https://stackoverflow.com/questions/46678400/is-there-a-faster-way-to-update-dataframe-column-values-based-on-conditions) as well, and accept an answer. – cs95 Oct 11 '17 at 04:01
  • It's not my question,if it is ,I will close it! – ileadall42 Oct 11 '17 at 04:04
  • @Tangfeifan I thought you were asking the question on behalf of your colleague or something. – cs95 Oct 11 '17 at 04:08
  • @COLDSPEED HaHa,I just find stackoverflow it's a good place to learn,so I have been active here for few days ,and I get to know a lot of people in pandas tags ,such as you!Very nice and excellent man! – ileadall42 Oct 11 '17 at 04:10
  • Well, thank you for asking the question! Hopefully the original OP might get some closure too. Hope to see you around asking more questions. – cs95 Oct 11 '17 at 04:17

2 Answers2

5

A bit of str.split and pd.get_dummies magic, inspired by Scott Boston and improved (from original version) thanks to JohnE.

df = df.set_index('id').source.str.get_dummies(',')
df.columns = df.columns.str.split().str[0].str.lower()
df = df.add_prefix('source_').reset_index()

print(df)
                     id  source_bar  source_cash  source_dtot  source_dtp  \
0  AV4MdG6Ihowv-SKBN_nB           0            0            0           1   
1  AV4Mc2vNhowv-SKBN_Rn           0            1            0           0   
2  AV4MeisikOpWpLdepWy6           1            0            0           1   
3  AV4MeRh6howv-SKBOBOn           0            1            0           0   
4  AV4Mezwchowv-SKBOB_S           1            0            1           0   
5  AV4MeB7yhowv-SKBOA5b           1            0            0           1   

   source_food  
0            1  
1            1  
2            0  
3            1  
4            0  
5            0  
cs95
  • 379,657
  • 97
  • 704
  • 746
1

You could also do this: What I am doing is splitting the "Source" column and creating new rows.Then I call get_dummies on the source column, then groupby the "id" column.

data_vec = pd.DataFrame(pd.concat([pd.Series(row['id'], row['source'].split(','))              
                    for _, row in data_vec.iterrows()])).reset_index()
data_vec.columns = ['source','id']

which gives:

    source           id
0   DTP     AV4MdG6Ihowv-SKBN_nB
1   FOOD    AV4MdG6Ihowv-SKBN_nB
2   Cash 1  AV4Mc2vNhowv-SKBN_Rn
3   FOOD    AV4Mc2vNhowv-SKBN_Rn
4   DTP     AV4MeisikOpWpLdepWy6
5   Bar     AV4MeisikOpWpLdepWy6
6   Cash 1  AV4MeRh6howv-SKBOBOn
7   FOOD    AV4MeRh6howv-SKBOBOn
8   DTOT    AV4Mezwchowv-SKBOB_S
9   Bar     AV4Mezwchowv-SKBOB_S
10  DTP     AV4MeB7yhowv-SKBOA5b
11  Bar     AV4MeB7yhowv-SKBOA5b

then call get_dummies() on the source column:

result = pd.concat([data_vec.get(['id']),
                           pd.get_dummies(data_vec['source'], prefix='source')],axis=1)

result.groupby('id').sum().reset_index()

Which gives:

          id           source_Bar   source_Cash 1   source_DTOT source_DTP  source_FOOD
0   AV4Mc2vNhowv-SKBN_Rn    0               1              0        0            1
1   AV4MdG6Ihowv-SKBN_nB    0               0              0        1            1
2   AV4MeB7yhowv-SKBOA5b    1               0              0        1            0
3   AV4MeRh6howv-SKBOBOn    0               1              0        0            1
4   AV4MeisikOpWpLdepWy6    1               0              0        1            0
5   AV4Mezwchowv-SKBOB_S    1               0              1        0            0
Gayatri
  • 2,197
  • 4
  • 23
  • 35
  • If you're going this way, you could just split on comma and stack. However, [just calling `pd.get_dummies` directly](https://stackoverflow.com/a/46679500/4909087) without doing all this would be much simpler, this is unnecessarily convoluted. – cs95 Oct 11 '17 at 04:16
  • I agree.I learnt a new thing today that we could call get_dummies() by giving comma. – Gayatri Oct 11 '17 at 04:23