3

I am working with a dataframe of Chess results like the following

    Opponent    Date    Time    Result
0   Hikaru  2020.03.02  01:22:54    1
1   Hikaru  2020.03.02  01:22:58    0.5
2   Hikaru  2020.03.03  01:18:17    1
3   Hikaru  2020.03.03  01:19:54    0
4   Hikaru  2020.03.03  01:19:45    1
5   Hikaru  2020.03.03  02:15:23    0.5
6   Anish   2020.03.03  02:21:25    0.5
7   Anish   2020.03.03  02:21:29    0
8   Anish   2020.03.04  15:45:12    1
9   Anish   2020.03.04  15:48:11    0.5
10  Anish   2020.03.04  16:05:01    0.5

Now I'd like to (1) group by Opponent, (2) group by date (within opponent), (3) tabulate the count of each of the Results, (4) give the sequence of Results obtained. The first 3 can be obtained with pd.crosstab, e.g. - a full example -

import pandas as pd

d = {'Opponent': ['Hikaru']*6 + ['Anish']*5,
     'Date': ['2020.03.02']*2 + ['2020.03.03']*6 + ['2020.03.04']*3,
    'Time': ['01:22:54', '01:22:58', '01:18:17', '01:19:54', '01:19:45', '02:15:23', '02:21:25', '02:21:29', '15:45:12', '15:48:11', '16:05:01'],
    'Result': ['1', '0.5', '1', '0', '1', '0.5', '0.5', '0', '1', '0.5', '0.5']}

df = pd.DataFrame(data = d)

pd.crosstab([df['Opponent'], df['Date']],
            df['Result'])

What I would like is the same output as this last pd.crosstab but with an added column showing the sequence of results (ordered by time) in that day's games between the two players, ordered by time. Ideally I'd like '1's as 'W', 0.5s as 'D', 0s as 'L' and a single long string in the column.

Desired output:


                    Result  0   0.5 1   result_seq
Opponent    Date                
Anish   2020.03.03  1   1   0   DL
        2020.03.04  0   2   1   WDD
Hikaru  2020.03.02  0   1   1   WD
        2020.03.03  1   1   2   WWLD

Please note that, in the original dataframe, it is NOT guaranteed that games/results are listed in time-order; and in the original dataframe, the datatype of every variable is str and I'd like to keep it that way in the final output (e.g. Results should remain as '1', '0', '0.5' strings, not '1.0', '0.5, '0.0', Dates should finally be strings; only the actual result counts can and will presumably be integers).


My thoughts: I thought of just ordering by time and then taking the column as a pandas Series. The problem is how to do this along with (i.e. after) the grouping by Opponent and Date.

Mobeus Zoom
  • 598
  • 5
  • 19

2 Answers2

2

If you have this df:

   Opponent        Date      Time  Result
0    Hikaru  2020.03.02  01:22:54     1.0
1    Hikaru  2020.03.02  01:22:58     0.5
2    Hikaru  2020.03.03  01:18:17     0.0
3    Hikaru  2020.03.03  01:19:45     1.0
4    Hikaru  2020.03.03  01:19:54     1.0
5    Hikaru  2020.03.03  02:15:23     0.5
6     Anish  2020.03.03  02:21:25     0.5
7     Anish  2020.03.03  02:21:29     0.0
8     Anish  2020.03.04  15:45:12     1.0
9     Anish  2020.03.04  15:48:11     0.5
10    Anish  2020.03.04  16:05:01     0.5

Then you can use .pivot_table() to obtain your result:

df_out = df.pivot_table(
    index=["Opponent", "Date"],
    columns="Result",
    aggfunc="size",
    fill_value=0,
).rename(columns={0.0: "0", 1.0: "1"})

df_out["result_seq"] = df.groupby(["Opponent", "Date"])["Result"].apply(
    lambda x: "".join({0: "L", 1: "W", 0.5: "D"}[v] for v in x)
)
print(df_out)

Prints:

Result               0  0.5  1 result_seq
Opponent Date                            
Anish    2020.03.03  1    1  0         DL
         2020.03.04  0    2  1        WDD
Hikaru   2020.03.02  0    1  1         WD
         2020.03.03  1    1  2       LWWD

EDIT: To sort values by time:

df["tmp"] = pd.to_datetime(df.Date + " " + df.Time)
df = df.sort_values(by="tmp").drop(columns="tmp")

df_out = df.pivot_table(
    index=["Opponent", "Date"],
    columns="Result",
    aggfunc="size",
    fill_value=0,
).rename(columns={0.0: "0", 1.0: "1"})

df_out["result_seq"] = df.groupby(["Opponent", "Date"])["Result"].apply(
    lambda x: "".join({0: "L", 1: "W", 0.5: "D"}[v] for v in x)
)
print(df_out)

Prints:

Result               0  0.5  1 result_seq
Opponent Date                            
Anish    2020.03.03  1    1  0         DL
         2020.03.04  0    2  1        WDD
Hikaru   2020.03.02  0    1  1         WD
         2020.03.03  1    1  2       WWLD
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • This answer doesn't provide the sequence ordered by time. (It just orders by W, then D, then L.) The last result_seq is wrong: you have `WWDL` but should have `LWWD`, see my example. (Why `LWWD`? Because the results against Hikaru on 2020.03.03 were 0, 1, 1, 0.5 in that order.) – Mobeus Zoom Apr 08 '21 at 22:16
  • As I explained, there is no intrinsic ordering; the sequence is ordering of results by *time*. First is `DL` because against Anish on 2020.03.03 the results were 0.5, 0 in that order – Mobeus Zoom Apr 08 '21 at 22:20
  • Great your solution solves the original df. However, I also stated that the df may not necessarily come already ordered by time, and your answer doesn't explicitly order anywhere by time. I've updated the df in the question to clearly reflect this additional difficulty (notice the results against Hikaru on 2020.03.03 now have rows for the 01:19:54 game and 01:19:45 game swapped) – Mobeus Zoom Apr 08 '21 at 22:45
  • 1
    Excellent, fulfils all requirements :) – Mobeus Zoom Apr 08 '21 at 22:57
0

You are on the right track. Just rename columns wit desired labels:

df1=pd.crosstab([df['Opponent'], df['Date']],
            df['Result']).reset_index().rename(columns={1.0:'W',0.5:'D',0.0:'L'})

Create new column in which you concat a slice of the relevant columns based on their column value

df1['result_seq'] = df1.iloc[:,2:].mul(df1.iloc[:,2:].columns.values).sum(axis=1)



Result Opponent        Date  L  D  W result_seq
0         Anish  2020.03.03  1  1  0         LD
1         Anish  2020.03.04  0  2  1        DDW
2        Hikaru  2020.03.02  0  1  1         DW
3        Hikaru  2020.03.03  1  1  2       LDWW
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • I get error `UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype(' dtype(' – Mobeus Zoom Apr 08 '21 at 22:47
  • I suspect, its quite likely you have more columns or your columns are not arranged as mine so slice doesnt work. Lets try `df1['result_seq'] = df1[['L','D','W']].mul(df1[['L','D','W']].columns.values).sum(axis=1)` – wwnde Apr 08 '21 at 22:51
  • Same error, sorry. Nope, no more columns. Just the exact `df` I provided in the question (with code). – Mobeus Zoom Apr 08 '21 at 22:56
  • Ahaaa, what version are you running because I cant preproduce the error? – wwnde Apr 08 '21 at 22:58