1

Is it possible to return the entirety of data not just part of which we are grouping by?

I mean for example - I have a dataframe with 5 columns and one of those columns contains distance, the other one is timestamp and the last important one is name. I grouped dataframe by timestamp - agg function I applied is (min) on distance. As a return i get correctly grouped dataframe with timestamp and distance - how can i add columns name there. If I group it by name as well then timestamp becomes duplicated - it has to stay unique. As a final result I need to get dataframe like this:

timestamp name distance
2020-03-03 15:30:235 Billy 123
2020-03-03 15:30:435 Johny 111

But instead i get this:

timestamp distance
2020-03-03 15:30:235 123
2020-03-03 15:30:435 111

Whole table has more than 700k rows so joining it back on distance gives me that amount of rows which my PC can't even handle.

Here is my groupby which gives me 2nd table:

grouped_df = df1.groupby('timestamp')['distance'].min()

Here is what i tried to do in order to get name inside the table:

grouped_df.merge(df1, how='left', left_on=['timestamp','distance'],    
right_on = ['timestamp','distance'])

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57

2 Answers2

1

Just try

out = df.sort_values('distance').drop_duplicates('timestamp')

Then try with transform

m = df.groupby('timestamp')['distance'].transform('min')
dout = df[df.distance==m]
BENY
  • 317,841
  • 20
  • 164
  • 234
0

You can use GroupBy.agg method to apply min on the distance column, and apply a relevant function on name column (lambda x:x to simply return its data). This will return the dataframe with both columns back to you:

grouped_df = df1.groupby('timestamp').agg({'distance':'min','name':lambda x:x})

Also if you want timestamp back as a column and not as index, you can set as_index=False during groupby.

Output:

>>> grouped_df = df1.groupby('timestamp', as_index=False).agg({'distance':'min','name':lambda x:x})
>>> grouped_df
              timestamp  distance   name
0  2020-03-03 15:30:235       123  Billy
1  2020-03-03 15:30:435       111  Johny
Ank
  • 1,704
  • 9
  • 14
  • Hi, thanks. Thought it doesn't return entire dataframe, column name is still not returned. It gives only timestamp and distance but doesnt show the person who did the distance. – Aleksander Kuś Jun 13 '21 at 15:24
  • ohh Yes, posted code too early, you can apply a relevant function on name column as well. Edited the answer, try now. – Ank Jun 13 '21 at 15:36