0

I am concatenating 2 pandas dataframes in Python 3.11.1 and then wanting to sort the result, however, I am getting unexpected results.

Here are the data:

df40
    CheckDate  CheckDay CheckTime ContestDate ContestDay ContestTime         id
0  2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  146731350
1  2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  146731350
2  2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  146731350
3  2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  146731371
4  2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  146731371
5  2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  146731371
6  2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  146731373
7  2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  146731373
8  2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  146731373

df25
    CheckDate  CheckDay CheckTime ContestDate ContestDay ContestTime         id
0  2023-07-15  Saturday     14:35  2023-07-15   Saturday    19:05:00  146731350
1  2023-07-15  Saturday     14:35  2023-07-15   Saturday    19:05:00  146731371
2  2023-07-15  Saturday     14:35  2023-07-15   Saturday    19:05:00  146731373

Here is my code:

df40 = pd.read_csv(fn02)
df41 = pd.concat([df40,df25])
df41.reset_index(drop=True, inplace=True)
df41.sort_values(by=['id', 'CheckDate', 'CheckTime'], inplace=True)

And this is what I get:

df41
     CheckDate  CheckDay CheckTime ContestDate ContestDay ContestTime         id
0   2023-07-15  Saturday     14:35  2023-07-15   Saturday    19:05:00  146731350
1   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  146731350
2   2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  146731350
3   2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  146731350
4   2023-07-15  Saturday     14:35  2023-07-15   Saturday    19:05:00  146731371
5   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  146731371
6   2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  146731371
7   2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  146731371
8   2023-07-15  Saturday     14:35  2023-07-15   Saturday    19:05:00  146731373
9   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  146731373
10  2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  146731373
11  2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  146731373

It appears that the df40 data is properly sorted by id and time (if you look closely, you will notice that the CheckTime = 14:19 obs are not properly sorted in df40 to begin with (this is an iterative process, so the CheckTime =14:19 obs being out of sort is a result of the same issue from the previous iteration...)), however the df25 data is only sorted by id - all of the CheckTime = 14:35 obs are not properly sorted (the CheckTime = 14:35 obs appear at the top instead of the bottom of each id group).

Any assistance is appreciated.

EDIT 1:

===========================================

Here is my console output AFTER running the code supplied above.

>>> df41 = pd.concat([df40,df25])
>>> df41.reset_index(drop=True, inplace=True)
>>> df41.sort_values(by=['id', 'CheckDate', 'CheckTime'], inplace=True)
>>> df41
     CheckDate  CheckDay CheckTime ContestDate ContestDay ContestTime  s         id   nt    m  delta_nt
9   2023-07-15  Saturday     16:00  2023-07-15   Saturday    19:05:00  2  146731350   51  114       NaN
1   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  2  146731350   47  114   -9612.0
2   2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  2  146731350   47  114       NaN
0   2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  2  146731350   48  114       NaN
10  2023-07-15  Saturday     16:00  2023-07-15   Saturday    19:05:00  2  146731371   63  113       NaN
4   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  2  146731371   55  113    6402.0
5   2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  2  146731371   55  113     -39.0
3   2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  2  146731371   55  113       NaN
11  2023-07-15  Saturday     16:00  2023-07-15   Saturday    19:05:00  2  146731373  108  183       NaN
7   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  2  146731373   89  183  -24453.0
8   2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  2  146731373   89  183   18146.0
6   2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  2  146731373   91  183       NaN
>>> df41.sort_values(by=['id', 'CheckDate', 'CheckTime'], inplace=True)
>>> df41
     CheckDate  CheckDay CheckTime ContestDate ContestDay ContestTime  s         id   nt    m  delta_nt
9   2023-07-15  Saturday     16:00  2023-07-15   Saturday    19:05:00  2  146731350   51  114       NaN
1   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  2  146731350   47  114   -9612.0
2   2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  2  146731350   47  114       NaN
0   2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  2  146731350   48  114       NaN
10  2023-07-15  Saturday     16:00  2023-07-15   Saturday    19:05:00  2  146731371   63  113       NaN
4   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  2  146731371   55  113    6402.0
5   2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  2  146731371   55  113     -39.0
3   2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  2  146731371   55  113       NaN
11  2023-07-15  Saturday     16:00  2023-07-15   Saturday    19:05:00  2  146731373  108  183       NaN
7   2023-07-15  Saturday     14:01  2023-07-15   Saturday    19:05:00  2  146731373   89  183  -24453.0
8   2023-07-15  Saturday     14:03  2023-07-15   Saturday    19:05:00  2  146731373   89  183   18146.0
6   2023-07-15  Saturday     14:19  2023-07-15   Saturday    19:05:00  2  146731373   91  183       NaN
>>>

Notice I do NOT get an error for the sort_values() statement, but the CheckTime = 16:00 obs are still NOT sorted properly. (I ran the whole script again, so the 16:00 obs have replaced the 14:35 obs...)

Can anyone explain why this is happening?

EDIT 2

===========================================

algorythms below suggested a problem with the time values being strings. This occurred to me so I tried this experiment which does not seem to mind that the time values are strings:

>>> df55 = pd.DataFrame(columns=['ContestTime'])
>>> df55['ContestTime'] = ["09:33","01:45","20:01","06:24"]
>>> df55.sort_values(by=['ContestTime'], inplace=True)
>>> df55
  ContestTime
1       01:45
3       06:24
0       09:33
2       20:01
>>>

Furthermore, algorythms suggested code does not seem to address the issue (unless I am coding it wrong...):

>>> df41 = pd.concat([df40,df25])
>>> df41['CheckTime'] = pd.to_datetime(df41['CheckTime']).dt.time
>>> df41.reset_index(drop=True, inplace=True)
>>> df41.sort_values(by=['id', 'CheckDate', 'CheckTime'], inplace=True)
>>> df41
     CheckDate  CheckDay CheckTime ContestDate ContestDay ContestTime  s         id   nt    m  delta_nt
9   2023-07-15  Saturday  16:00:00  2023-07-15   Saturday    19:05:00  2  146731350   51  114       NaN
1   2023-07-15  Saturday  14:01:00  2023-07-15   Saturday    19:05:00  2  146731350   47  114   -9612.0
2   2023-07-15  Saturday  14:03:00  2023-07-15   Saturday    19:05:00  2  146731350   47  114       NaN
0   2023-07-15  Saturday  14:19:00  2023-07-15   Saturday    19:05:00  2  146731350   48  114       NaN
10  2023-07-15  Saturday  16:00:00  2023-07-15   Saturday    19:05:00  2  146731371   63  113       NaN
4   2023-07-15  Saturday  14:01:00  2023-07-15   Saturday    19:05:00  2  146731371   55  113    6402.0
5   2023-07-15  Saturday  14:03:00  2023-07-15   Saturday    19:05:00  2  146731371   55  113     -39.0
3   2023-07-15  Saturday  14:19:00  2023-07-15   Saturday    19:05:00  2  146731371   55  113       NaN
11  2023-07-15  Saturday  16:00:00  2023-07-15   Saturday    19:05:00  2  146731373  108  183       NaN
7   2023-07-15  Saturday  14:01:00  2023-07-15   Saturday    19:05:00  2  146731373   89  183  -24453.0
8   2023-07-15  Saturday  14:03:00  2023-07-15   Saturday    19:05:00  2  146731373   89  183   18146.0
6   2023-07-15  Saturday  14:19:00  2023-07-15   Saturday    19:05:00  2  146731373   91  183       NaN
>>>

EDIT 3

=========================================== Mohsin hasan asked for .dtypes info:

>>> df25.dtypes
CheckDate      object
CheckDay       object
CheckTime      object
ContestDate    object
ContestDay     object
ContestTime    object
s               int64
id              int64
nt              int64
m               int64
dtype: object
>>> df40.dtypes
CheckDate      object
CheckDay       object
CheckTime      object
ContestDate    object
ContestDay     object
ContestTime    object
s               int64
id              int64
nt              int64
m               int64
dtype: object
>>>

EDIT 4

=========================================== Results from Mohsin's test code:

>>> df1 = pd.DataFrame({'c1': ['16:00:00']})
>>> df2 = pd.DataFrame({'c1': ['14:02:00', '14:01:00']})
>>> df3 = pd.concat([df1, df2])
>>> df3.reset_index(drop=True, inplace=True)
>>> df3.sort_values('c1', inplace=True)
>>> print(df3)
         c1
2  14:01:00
1  14:02:00
0  16:00:00
>>>

Additional str length info:

>>> df40['CheckDate'].astype(str).str.len()
0    10
1    10
2    10
3    10
4    10
5    10
6    10
7    10
8    10
Name: CheckDate, dtype: int64
>>> df25['CheckDate'].astype(str).str.len()
0    10
1    10
Name: CheckDate, dtype: int64

EDIT 5.1

===========================================

>>>
>>> import pandas as pd
>>> import os
>>>
>>> os.chdir("E:\\")
>>>
>>>
>>> df1 = pd.DataFrame({'c1': ['16:00:00']})
>>> df2 = pd.DataFrame({'c1': ['14:02:00', '14:01:00']})
>>>
>>> df2.to_csv("df2.csv", index=False)
>>>
>>> del df2
>>>
>>> df2 = pd.read_csv("df2.csv")
>>>
>>> df3 = pd.concat([df1, df2])
>>> print(df3)
         c1
0  16:00:00
0  14:02:00
1  14:01:00
>>> df3.reset_index(drop=True, inplace=True)
>>> df3.sort_values('c1', inplace=True)
>>> print(df3)
         c1
2  14:01:00
1  14:02:00
0  16:00:00
>>>

EDIT 5.2

===========================================

>>>
>>> df1 = pd.DataFrame({'c0': ['B','B','A'], 'c1': ['16:00:00','12:54:00','14:43:00']})
>>> df2 = pd.DataFrame({'c0': ['A','B','B'], 'c1': ['14:02:00', '14:01:00','15:16:00']})
>>> df1
  c0        c1
0  B  16:00:00
1  B  12:54:00
2  A  14:43:00
>>> df2
  c0        c1
0  A  14:02:00
1  B  14:01:00
2  B  15:16:00
>>>
>>> df2.to_csv("df2.csv", index=False)
>>>
>>> del df2
>>>
>>> df2 = pd.read_csv("df2.csv")
>>>
>>> df3 = pd.concat([df1, df2])
>>> print(df3)
  c0        c1
0  B  16:00:00
1  B  12:54:00
2  A  14:43:00
0  A  14:02:00
1  B  14:01:00
2  B  15:16:00
>>>
>>> df3.reset_index(drop=True, inplace=True)
>>> df3.sort_values(by=['c0', 'c1'], inplace=True)
>>> print(df3)
  c0        c1
3  A  14:02:00
2  A  14:43:00
1  B  12:54:00
4  B  14:01:00
5  B  15:16:00
0  B  16:00:00
>>>
Dan
  • 165
  • 5
  • 18

2 Answers2

0

It looks like your 'CheckTime' is being seen as text instead of a time. So when you try to sort it, it's being sorted like words in a dictionary, not times on a clock. That's why 16:00 is coming before 14:01 - 1 is before 4 in the dictionary. Try this:

df41['CheckTime'] = pd.to_datetime(df41['CheckTime']).dt.time

You should then be able to sort by CheckTime

algorythms
  • 1,547
  • 1
  • 15
  • 28
  • Thanks for the Answer. Please see Edit 2 above. Maybe I am not coding your suggestion correctly...? – Dan Jul 15 '23 at 20:26
  • When sorting HH:MM timestamps, alphabetically is the same as chronologically – juanpethes Jul 15 '23 at 21:02
  • @juanpethes - So you do not think this is a variable typing issue? Any ideas what is causing these results? – Dan Jul 15 '23 at 21:05
  • I don't know, it seems like an odd issue... You are correct that the best practise is always to store it as the correct type, I just wanted to point out that when time is "sorted" as string decreasingly like YYYY-MM-DD HH:MM:SS then alphabetically = chronologically – juanpethes Jul 15 '23 at 21:09
  • @juanpethes - Ok. Thanks for the clarification. That jives with my HH:MM as text sorting experimenting above under Edit 2. – Dan Jul 15 '23 at 21:12
  • @dan Could you please add output of `df41.dtypes`? It could help us figure out what is happening. – Mohsin hasan Jul 16 '23 at 04:36
  • @Mohsin - Thanks for looking at this. See Edit 3 above. – Dan Jul 16 '23 at 13:41
  • Ok, in that case. Could you also print string lengths just to rule out any characters that are not visible in html display? Basically, check df['CheckDate'].astype(str).str.len() and df['CheckTime'].astype(str).str.len() – Mohsin hasan Jul 16 '23 at 15:57
  • @Mohsin - See Edit 4 above. Yes, this works as expected. My working hypothesis right now is that it has something to do with df40 being read in from csv using pd.read_csv(). I am going to update your test code to test ASAP. – Dan Jul 16 '23 at 16:00
  • @Mohsin - Please see Edit 5.1 and 5.2. In 5.1, I printed df2 to csv, deleted it, and read it in with pd.read_csv(). It worked as expected - I was not able to reproduce the issue. In 5.2, I added a 2nd sort by variable, but it also worked as expected (not able to reproduce the issue). – Dan Jul 16 '23 at 18:21
0

I finally figured this out: Apparently, the date/time values from one data source were stored as a string (incorrectly) and the date/time values for another were stored as datetime library values (correctly). Then, when the two dataframes were concatenated, Pandas stored the values stored as different types in the same column. I was not aware that Pandas would allow values of different types stored in the same column (I am still not 100% sure of that (it seems like a VERY WEIRD thing), but apparently that is what was happening here).

Eventually, I got the error message below and this post seems to address the issue:

'<' not supported between instances of 'datetime.date' and 'str'

Dan
  • 165
  • 5
  • 18