I am concatenating 2 pandas dataframes in Python 3.11.1 and then wanting to sort the result, however, I am getting unexpected results.
Here are the data:
df40
CheckDate CheckDay CheckTime ContestDate ContestDay ContestTime id
0 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 146731350
1 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 146731350
2 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 146731350
3 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 146731371
4 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 146731371
5 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 146731371
6 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 146731373
7 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 146731373
8 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 146731373
df25
CheckDate CheckDay CheckTime ContestDate ContestDay ContestTime id
0 2023-07-15 Saturday 14:35 2023-07-15 Saturday 19:05:00 146731350
1 2023-07-15 Saturday 14:35 2023-07-15 Saturday 19:05:00 146731371
2 2023-07-15 Saturday 14:35 2023-07-15 Saturday 19:05:00 146731373
Here is my code:
df40 = pd.read_csv(fn02)
df41 = pd.concat([df40,df25])
df41.reset_index(drop=True, inplace=True)
df41.sort_values(by=['id', 'CheckDate', 'CheckTime'], inplace=True)
And this is what I get:
df41
CheckDate CheckDay CheckTime ContestDate ContestDay ContestTime id
0 2023-07-15 Saturday 14:35 2023-07-15 Saturday 19:05:00 146731350
1 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 146731350
2 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 146731350
3 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 146731350
4 2023-07-15 Saturday 14:35 2023-07-15 Saturday 19:05:00 146731371
5 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 146731371
6 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 146731371
7 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 146731371
8 2023-07-15 Saturday 14:35 2023-07-15 Saturday 19:05:00 146731373
9 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 146731373
10 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 146731373
11 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 146731373
It appears that the df40 data is properly sorted by id and time (if you look closely, you will notice that the CheckTime = 14:19 obs are not properly sorted in df40 to begin with (this is an iterative process, so the CheckTime =14:19 obs being out of sort is a result of the same issue from the previous iteration...)), however the df25 data is only sorted by id - all of the CheckTime = 14:35 obs are not properly sorted (the CheckTime = 14:35 obs appear at the top instead of the bottom of each id group).
Any assistance is appreciated.
EDIT 1:
===========================================
Here is my console output AFTER running the code supplied above.
>>> df41 = pd.concat([df40,df25])
>>> df41.reset_index(drop=True, inplace=True)
>>> df41.sort_values(by=['id', 'CheckDate', 'CheckTime'], inplace=True)
>>> df41
CheckDate CheckDay CheckTime ContestDate ContestDay ContestTime s id nt m delta_nt
9 2023-07-15 Saturday 16:00 2023-07-15 Saturday 19:05:00 2 146731350 51 114 NaN
1 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 2 146731350 47 114 -9612.0
2 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 2 146731350 47 114 NaN
0 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 2 146731350 48 114 NaN
10 2023-07-15 Saturday 16:00 2023-07-15 Saturday 19:05:00 2 146731371 63 113 NaN
4 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 2 146731371 55 113 6402.0
5 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 2 146731371 55 113 -39.0
3 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 2 146731371 55 113 NaN
11 2023-07-15 Saturday 16:00 2023-07-15 Saturday 19:05:00 2 146731373 108 183 NaN
7 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 2 146731373 89 183 -24453.0
8 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 2 146731373 89 183 18146.0
6 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 2 146731373 91 183 NaN
>>> df41.sort_values(by=['id', 'CheckDate', 'CheckTime'], inplace=True)
>>> df41
CheckDate CheckDay CheckTime ContestDate ContestDay ContestTime s id nt m delta_nt
9 2023-07-15 Saturday 16:00 2023-07-15 Saturday 19:05:00 2 146731350 51 114 NaN
1 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 2 146731350 47 114 -9612.0
2 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 2 146731350 47 114 NaN
0 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 2 146731350 48 114 NaN
10 2023-07-15 Saturday 16:00 2023-07-15 Saturday 19:05:00 2 146731371 63 113 NaN
4 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 2 146731371 55 113 6402.0
5 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 2 146731371 55 113 -39.0
3 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 2 146731371 55 113 NaN
11 2023-07-15 Saturday 16:00 2023-07-15 Saturday 19:05:00 2 146731373 108 183 NaN
7 2023-07-15 Saturday 14:01 2023-07-15 Saturday 19:05:00 2 146731373 89 183 -24453.0
8 2023-07-15 Saturday 14:03 2023-07-15 Saturday 19:05:00 2 146731373 89 183 18146.0
6 2023-07-15 Saturday 14:19 2023-07-15 Saturday 19:05:00 2 146731373 91 183 NaN
>>>
Notice I do NOT get an error for the sort_values() statement, but the CheckTime = 16:00 obs are still NOT sorted properly. (I ran the whole script again, so the 16:00 obs have replaced the 14:35 obs...)
Can anyone explain why this is happening?
EDIT 2
===========================================
algorythms below suggested a problem with the time values being strings. This occurred to me so I tried this experiment which does not seem to mind that the time values are strings:
>>> df55 = pd.DataFrame(columns=['ContestTime'])
>>> df55['ContestTime'] = ["09:33","01:45","20:01","06:24"]
>>> df55.sort_values(by=['ContestTime'], inplace=True)
>>> df55
ContestTime
1 01:45
3 06:24
0 09:33
2 20:01
>>>
Furthermore, algorythms suggested code does not seem to address the issue (unless I am coding it wrong...):
>>> df41 = pd.concat([df40,df25])
>>> df41['CheckTime'] = pd.to_datetime(df41['CheckTime']).dt.time
>>> df41.reset_index(drop=True, inplace=True)
>>> df41.sort_values(by=['id', 'CheckDate', 'CheckTime'], inplace=True)
>>> df41
CheckDate CheckDay CheckTime ContestDate ContestDay ContestTime s id nt m delta_nt
9 2023-07-15 Saturday 16:00:00 2023-07-15 Saturday 19:05:00 2 146731350 51 114 NaN
1 2023-07-15 Saturday 14:01:00 2023-07-15 Saturday 19:05:00 2 146731350 47 114 -9612.0
2 2023-07-15 Saturday 14:03:00 2023-07-15 Saturday 19:05:00 2 146731350 47 114 NaN
0 2023-07-15 Saturday 14:19:00 2023-07-15 Saturday 19:05:00 2 146731350 48 114 NaN
10 2023-07-15 Saturday 16:00:00 2023-07-15 Saturday 19:05:00 2 146731371 63 113 NaN
4 2023-07-15 Saturday 14:01:00 2023-07-15 Saturday 19:05:00 2 146731371 55 113 6402.0
5 2023-07-15 Saturday 14:03:00 2023-07-15 Saturday 19:05:00 2 146731371 55 113 -39.0
3 2023-07-15 Saturday 14:19:00 2023-07-15 Saturday 19:05:00 2 146731371 55 113 NaN
11 2023-07-15 Saturday 16:00:00 2023-07-15 Saturday 19:05:00 2 146731373 108 183 NaN
7 2023-07-15 Saturday 14:01:00 2023-07-15 Saturday 19:05:00 2 146731373 89 183 -24453.0
8 2023-07-15 Saturday 14:03:00 2023-07-15 Saturday 19:05:00 2 146731373 89 183 18146.0
6 2023-07-15 Saturday 14:19:00 2023-07-15 Saturday 19:05:00 2 146731373 91 183 NaN
>>>
EDIT 3
=========================================== Mohsin hasan asked for .dtypes info:
>>> df25.dtypes
CheckDate object
CheckDay object
CheckTime object
ContestDate object
ContestDay object
ContestTime object
s int64
id int64
nt int64
m int64
dtype: object
>>> df40.dtypes
CheckDate object
CheckDay object
CheckTime object
ContestDate object
ContestDay object
ContestTime object
s int64
id int64
nt int64
m int64
dtype: object
>>>
EDIT 4
=========================================== Results from Mohsin's test code:
>>> df1 = pd.DataFrame({'c1': ['16:00:00']})
>>> df2 = pd.DataFrame({'c1': ['14:02:00', '14:01:00']})
>>> df3 = pd.concat([df1, df2])
>>> df3.reset_index(drop=True, inplace=True)
>>> df3.sort_values('c1', inplace=True)
>>> print(df3)
c1
2 14:01:00
1 14:02:00
0 16:00:00
>>>
Additional str length info:
>>> df40['CheckDate'].astype(str).str.len()
0 10
1 10
2 10
3 10
4 10
5 10
6 10
7 10
8 10
Name: CheckDate, dtype: int64
>>> df25['CheckDate'].astype(str).str.len()
0 10
1 10
Name: CheckDate, dtype: int64
EDIT 5.1
===========================================
>>>
>>> import pandas as pd
>>> import os
>>>
>>> os.chdir("E:\\")
>>>
>>>
>>> df1 = pd.DataFrame({'c1': ['16:00:00']})
>>> df2 = pd.DataFrame({'c1': ['14:02:00', '14:01:00']})
>>>
>>> df2.to_csv("df2.csv", index=False)
>>>
>>> del df2
>>>
>>> df2 = pd.read_csv("df2.csv")
>>>
>>> df3 = pd.concat([df1, df2])
>>> print(df3)
c1
0 16:00:00
0 14:02:00
1 14:01:00
>>> df3.reset_index(drop=True, inplace=True)
>>> df3.sort_values('c1', inplace=True)
>>> print(df3)
c1
2 14:01:00
1 14:02:00
0 16:00:00
>>>
EDIT 5.2
===========================================
>>>
>>> df1 = pd.DataFrame({'c0': ['B','B','A'], 'c1': ['16:00:00','12:54:00','14:43:00']})
>>> df2 = pd.DataFrame({'c0': ['A','B','B'], 'c1': ['14:02:00', '14:01:00','15:16:00']})
>>> df1
c0 c1
0 B 16:00:00
1 B 12:54:00
2 A 14:43:00
>>> df2
c0 c1
0 A 14:02:00
1 B 14:01:00
2 B 15:16:00
>>>
>>> df2.to_csv("df2.csv", index=False)
>>>
>>> del df2
>>>
>>> df2 = pd.read_csv("df2.csv")
>>>
>>> df3 = pd.concat([df1, df2])
>>> print(df3)
c0 c1
0 B 16:00:00
1 B 12:54:00
2 A 14:43:00
0 A 14:02:00
1 B 14:01:00
2 B 15:16:00
>>>
>>> df3.reset_index(drop=True, inplace=True)
>>> df3.sort_values(by=['c0', 'c1'], inplace=True)
>>> print(df3)
c0 c1
3 A 14:02:00
2 A 14:43:00
1 B 12:54:00
4 B 14:01:00
5 B 15:16:00
0 B 16:00:00
>>>