3

I am trying to convert this dataframe:

data_in = {
    'dates': [2017, 2017, 2018, 2019, 2019, 2019],
    'names': ['Roger', 'Rafa', 'Roger', 'Rafa', 'Novak', 'Dom']
}
df_in = pd.DataFrame(data_in)
>>> df_in
    dates   names
0   2017    Roger
1   2017    Rafa
2   2018    Roger
3   2019    Rafa
4   2019    Novak
5   2019    Dom

into this binary matrix:

>>> df_out

    Roger   Rafa    Novak   Dom
dates               
2017    1   1   0   0
2018    1   0   0   0
2019    0   1   1   1

with all the dates as index, all the names as columns, and the data being 1 if occurence of the name at the date, and np.NaN or 0 if not.

I can build the df_out dataframe with its index and columns, but how would you get the data?

NatOuzo
  • 57
  • 4
  • Does this answer your question? [How to pivot pandas DataFrame column to create binary "value table"?](https://stackoverflow.com/questions/39812275/how-to-pivot-pandas-dataframe-column-to-create-binary-value-table) – LoukasPap Jan 07 '21 at 11:53
  • What do you mean update the data? Also what should happen if you have 2017, roger 2 times? – Dani Mesejo Jan 07 '21 at 11:58
  • @L.Papadopoulos – not really, as I want here to convert a column with duplicates to an index with unique values. – NatOuzo Jan 07 '21 at 14:20
  • @DaniMesejo – ''update'' the data has been reformulated. And each couple (dates, names) is unique. – NatOuzo Jan 07 '21 at 14:23

2 Answers2

5

Use get_dummies with max for 1 if exist values else 0:

df = pd.get_dummies(df_in.set_index('dates')['names']).groupby(level=0).max()
print (df)
       Dom  Novak  Rafa  Roger
dates                         
2017     0      0     1      1
2018     0      0     0      1
2019     1      1     1      0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
4

Use crosstab:

res = pd.crosstab(df_in['dates'], df_in['names'])
print(res)

Output

names  Dom  Novak  Rafa  Roger
dates                         
2017     0      0     1      1
2018     0      0     0      1
2019     1      1     1      0

If only need indicator function, do:

res = pd.crosstab(df_in['dates'], df_in['names']).gt(0).astype(int)
print(res)

Output

names  Dom  Novak  Rafa  Roger
dates                         
2017     0      0     1      1
2018     0      0     0      1
2019     1      1     1      0
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76