29

Iam newbie in python. I have huge a dataframe with millions of rows and id. my data looks like this:

Time    ID  X   Y
8:00    A   23  100
9:00    B   24  110
10:00   B   25  120
11:00   C   26  130
12:00   C   27  140
13:00   A   28  150
14:00   A   29  160
15:00   D   30  170
16:00   C   31  180
17:00   B   32  190
18:00   A   33  200
19:00   C   34  210
20:00   A   35  220
21:00   B   36  230
22:00   C   37  240
23:00   B   38  250

I want to sort the data on id and time. The expected result what I looking for like this"

Time    ID  X   Y
8:00    A   23  100
13:00   A   28  150
14:00   A   29  160
18:00   A   33  200
20:00   A   35  220
9:00    B   24  110
10:00   B   25  120
17:00   B   32  190
21:00   B   36  230
23:00   B   38  250
11:00   C   26  130
12:00   C   27  140
16:00   C   31  180
19:00   C   34  210
22:00   C   37  240
15:00   D   30  170

and I want to pick only "The first and the last" of the id and eliminate the rest. The expected result looks like this:

Time    ID  X   Y
8:00    A   23  100
20:00   A   35  220
9:00    B   24  110
23:00   B   38  250
11:00   C   26  130
22:00   C   37  240
15:00   D   30  170

how to do it in pandas? thank you for your advice

cs95
  • 379,657
  • 97
  • 704
  • 746
Arief Hidayat
  • 937
  • 1
  • 8
  • 19

3 Answers3

41

Use groupby, find the head and tail for each group, and concat the two.

g = df.groupby('ID')

(pd.concat([g.head(1), g.tail(1)])
   .drop_duplicates()
   .sort_values('ID')
   .reset_index(drop=True))

    Time ID   X    Y
0   8:00  A  23  100
1  20:00  A  35  220
2   9:00  B  24  110
3  23:00  B  38  250
4  11:00  C  26  130
5  22:00  C  37  240
6  15:00  D  30  170

If you can guarantee each ID group has at least two rows, the drop_duplicates call is not needed.


Details

g.head(1)

    Time ID   X    Y
0   8:00  A  23  100
1   9:00  B  24  110
3  11:00  C  26  130
7  15:00  D  30  170

g.tail(1)

     Time ID   X    Y
7   15:00  D  30  170
12  20:00  A  35  220
14  22:00  C  37  240
15  23:00  B  38  250

pd.concat([g.head(1), g.tail(1)])

     Time ID   X    Y
0    8:00  A  23  100
1    9:00  B  24  110
3   11:00  C  26  130
7   15:00  D  30  170
7   15:00  D  30  170
12  20:00  A  35  220
14  22:00  C  37  240
15  23:00  B  38  250
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Impressive.. would you mind explaining to me what is the function of `pd.concat` in `groupby`? – Arief Hidayat Dec 26 '18 at 04:49
  • 1
    @Arief `g.head(1)` and `g.tail(1)` return two separate dataframes; I then use `pd.concat` to join the two DataFrames together, vertically. `concat` and `groupby` are independent operations. – cs95 Dec 26 '18 at 04:50
  • thank you so much for more detail explanation. how to save to new csv? because I tried `to.csv` I got error. – Arief Hidayat Dec 26 '18 at 04:54
  • @Arief `result.to_csv('file.csv')`. Remember to accept the answer if it was helpful, thanks :-) – cs95 Dec 26 '18 at 04:55
  • how to consider sorted by time and ID? because I tried to my data I confused about the time data. – Arief Hidayat Dec 26 '18 at 05:04
  • @Arief If your input data is already sorted by time, then the output will just need to be sorted by ID. I've done that for you already, can you please check? If you have any more follow up questions, I'd request you to open a new question, it's easier than having users iteratively debug your questions. – cs95 Dec 26 '18 at 05:12
  • of course, I am sorry I forget to accept it. thank you again – Arief Hidayat Dec 27 '18 at 13:07
4

If you create a small function to only select the first and last rows of a DataFrame, you can apply this to a group-by, like so:

df.groupby('ID').apply(lambda x: x.iloc[[0, -1]]).reset_index(drop=True)

As others have mentioned, it might be nice to also .drop_duplicates() or similar after the fact, to filter out duplicated rows for cases where there was only one row for the 'ID'.

johnnybarrels
  • 344
  • 3
  • 7
  • 3
    I like your solution, however there is a typo in formula (df vs. x), and I suggest to add reset_index to avoid duplicated index columns in result: df.groupby('ID').apply(lambda x: x.iloc[[0, -1]]).reset_index(drop=True) – bpelhos Mar 01 '22 at 11:20
  • @bpelhos thanks for that, original answer updated! – johnnybarrels May 18 '22 at 02:41
0

You can get the columns you want by simply sorting the 'ID' column. by:

df_sorted = df.sort_values("ID")

After that make an empty data frame of the same columns by searching blank:

all = df[df.ID=='']

Store all the unique values present in the 'ID' column by:

uni = list(df.ID.unique())

Then finally, append the first and last head and tail in that empty data frame that you created earlier.

Final code will look something like this:

df_sorted = df.sort_values("ID")
all = df[df.ID=='']
uni = list(df.ID.unique())
for x in uni:
  all = all.append(df[df.ID==x].head(1), ignore_index=True)
  all = all.append(df[df.ID==x].tail(1), ignore_index=True)
all
Akash Kumar
  • 540
  • 2
  • 4
  • 15