get index per ID - from example data using text input to pandas dataframe? easily reproducible

Question

Text can be used as input to pandas dataframes to make easily reproducible examples for testing solutions.1

import pandas as pd
from io import StringIO

txt= """
ID,datetime,value
AB-CL-34,07/10/2022 10:00:00,5 
AB-CL-34,07/10/2022 11:15:10,7 
AB-CL-34,09/10/2022 15:30:30,13 
BX-RT-55,06/10/2022 11:30:22,0 
BX-RT-55,10/10/2022 22:44:11,1 
BX-RT-55,10/10/2022 23:30:22,6 
"""

df = pd.read_csv(StringIO(txt), parse_dates=[1], dayfirst=True)

How do I index df such that there is an integer index according to each ID, keeping the ID variable?

imagined output

        ID            datetime value
0 AB-CL-34 07/10/2022 10:00:00     5 
0 AB-CL-34 07/10/2022 11:15:10     7 
0 AB-CL-34 09/10/2022 15:30:30    13 
1 BX-RT-55 06/10/2022 11:30:22     0 
1 BX-RT-55 10/10/2022 22:44:11     1 
1 BX-RT-55 10/10/2022 23:30:22     6

edit ID values are now chr/str, not int, as they should have been (apologies to Marcus_CH.

score 1 · Accepted Answer · edited Oct 16 '22 at 15:48

1

You could try:

df = pd.read_csv(StringIO(txt),\
             parse_dates=[1],\
             dayfirst=True)\
             .assign(id_index= lambda df: df\
             .groupby('ID', sort=False).ngroup())\
             .set_index("id_index")\
             .rename_axis(index=None)

Output

+----+----------+---------------------+---------+
|    |       ID |            datetime |   value |
|----+----------+---------------------+---------|
|  0 | AB-CL-34 | 2022-10-07 10:00:00 |       5 |
|  0 | AB-CL-34 | 2022-10-07 11:15:10 |       7 |
|  0 | AB-CL-34 | 2022-10-09 15:30:30 |      13 |
|  1 | BX-RT-55 | 2022-10-06 11:30:22 |       0 |
|  1 | BX-RT-55 | 2022-10-10 22:44:11 |       1 |
|  1 | BX-RT-55 | 2022-10-10 23:30:22 |       6 |
+----+----------+---------------------+---------+

References:

indexing, please have a look at the official reference. This is a way better explanation as I could do.
index=None to remove the index name (try it without this statement)
lambda is an anonymous function. You could also do something like df["id_col"] = df["ID"] - 1. lambda makes it easier and faster.
the brackets are so that you can do line breaks without \ for linebreak (so it's just for styling)

edited Oct 16 '22 at 15:48

Johan

186
15

answered Oct 15 '22 at 14:21

Marco_CH

3,243
8
25

Thanks for replying, @Marco_CH, and so fast. This is close but not exactly the df I imagine. The dataframe I am comparing with has an additional col to the "left" of ID but without an explicit label. Please be welcome to take a closer look at the df I posted in the bottom of my question. – Johan Oct 15 '22 at 14:33
1

Thanks, haven't noticed that. Sorry... Answer is updated. – Marco_CH Oct 15 '22 at 14:38
Great! :-) Can you add some explanation to your solution, for instance at the bottom of your answer? I would like to better understand (1) how lambda works in this case (I guess x = df)? (2) if you should pick one alternative to using lambda code to get the same result, what would that be? (3) why did you surround the code in parentheses? – Johan Oct 15 '22 at 15:16
1

I added some more explanations, but especially for index and lambda I suggest to have a look at the references or watching some youtube tutorials. `lambda` is a very powerful function in python. High recommended to understand this concept, will help you a lot. – Marco_CH Oct 15 '22 at 15:20
1

Thanks for your attitude let alone your solution. I will be sure to research `lambda´. I believe this suggests that I raise another question specifically about indexing defaults/standards. – Johan Oct 15 '22 at 15:25
1

I changed the ID values to chr/str instead of int. I also figured out how to change your solution to solve this new problem. So I edited your answer. Thanks again. – Johan Oct 15 '22 at 17:56

get index per ID - from example data using text input to pandas dataframe? easily reproducible

1 Answers1