1

Text can be used as input to pandas dataframes to make easily reproducible examples for testing solutions.1

import pandas as pd
from io import StringIO

txt= """
ID,datetime,value
AB-CL-34,07/10/2022 10:00:00,5 
AB-CL-34,07/10/2022 11:15:10,7 
AB-CL-34,09/10/2022 15:30:30,13 
BX-RT-55,06/10/2022 11:30:22,0 
BX-RT-55,10/10/2022 22:44:11,1 
BX-RT-55,10/10/2022 23:30:22,6 
"""

df = pd.read_csv(StringIO(txt), parse_dates=[1], dayfirst=True)

How do I index df such that there is an integer index according to each ID, keeping the ID variable?

imagined output

        ID            datetime value
0 AB-CL-34 07/10/2022 10:00:00     5 
0 AB-CL-34 07/10/2022 11:15:10     7 
0 AB-CL-34 09/10/2022 15:30:30    13 
1 BX-RT-55 06/10/2022 11:30:22     0 
1 BX-RT-55 10/10/2022 22:44:11     1 
1 BX-RT-55 10/10/2022 23:30:22     6 

edit ID values are now chr/str, not int, as they should have been (apologies to Marcus_CH.

Johan
  • 186
  • 15

1 Answers1

1

You could try:

df = pd.read_csv(StringIO(txt),\
             parse_dates=[1],\
             dayfirst=True)\
             .assign(id_index= lambda df: df\
             .groupby('ID', sort=False).ngroup())\
             .set_index("id_index")\
             .rename_axis(index=None)

Output

+----+----------+---------------------+---------+
|    |       ID |            datetime |   value |
|----+----------+---------------------+---------|
|  0 | AB-CL-34 | 2022-10-07 10:00:00 |       5 |
|  0 | AB-CL-34 | 2022-10-07 11:15:10 |       7 |
|  0 | AB-CL-34 | 2022-10-09 15:30:30 |      13 |
|  1 | BX-RT-55 | 2022-10-06 11:30:22 |       0 |
|  1 | BX-RT-55 | 2022-10-10 22:44:11 |       1 |
|  1 | BX-RT-55 | 2022-10-10 23:30:22 |       6 |
+----+----------+---------------------+---------+

References:

  1. indexing, please have a look at the official reference. This is a way better explanation as I could do.
  2. index=None to remove the index name (try it without this statement)
  3. lambda is an anonymous function. You could also do something like df["id_col"] = df["ID"] - 1. lambda makes it easier and faster.
  4. the brackets are so that you can do line breaks without \ for linebreak (so it's just for styling)
Johan
  • 186
  • 15
Marco_CH
  • 3,243
  • 8
  • 25
  • Thanks for replying, @Marco_CH, and so fast. This is close but not exactly the df I imagine. The dataframe I am comparing with has an additional col to the "left" of ID but without an explicit label. Please be welcome to take a closer look at the df I posted in the bottom of my question. – Johan Oct 15 '22 at 14:33
  • 1
    Thanks, haven't noticed that. Sorry... Answer is updated. – Marco_CH Oct 15 '22 at 14:38
  • Great! :-) Can you add some explanation to your solution, for instance at the bottom of your answer? I would like to better understand (1) how lambda works in this case (I guess x = df)? (2) if you should pick one alternative to using lambda code to get the same result, what would that be? (3) why did you surround the code in parentheses? – Johan Oct 15 '22 at 15:16
  • 1
    I added some more explanations, but especially for index and lambda I suggest to have a look at the references or watching some youtube tutorials. `lambda` is a very powerful function in python. High recommended to understand this concept, will help you a lot. – Marco_CH Oct 15 '22 at 15:20
  • 1
    Thanks for your attitude let alone your solution. I will be sure to research `lambda´. I believe this suggests that I raise another question specifically about indexing defaults/standards. – Johan Oct 15 '22 at 15:25
  • 1
    I changed the ID values to chr/str instead of int. I also figured out how to change your solution to solve this new problem. So I edited your answer. Thanks again. – Johan Oct 15 '22 at 17:56