Using Pandas DataFrames, I am trying to implement self join on a data table with shop ids, product ids, and levels. The Shop and Product ID determine a level. Unknown levels are indicated by null. The problem is as follows: For Shop ID A, find the level of the product in other shops. For products not sold by other shops, report a null value. For products which are not sold by Shop A, report a null value.
I have provided the complete pandas code below as well as the expected solution.
from io import StringIO
import pandas as pd
SOURCE_DATA = StringIO("""
shop_id,product_id,level
A,p,1
A,q,2
A,r,3
B,p,2
B,q,1
B,s,3
C,p,3
C,q,1
C,t,2
D,p,3
D,q,3
D,r,3
E,s,1
E,t,2
E,u,3
""")
EXPECTED = StringIO("""
target_shop_id,shop_id,product_id,target_level,level
A,B,p,1,2
A,B,q,2,1
A,B,r,3,NULL
A,B,s,NULL,3
A,C,p,1,3
A,C,q,2,1
A,C,r,3,NULL
A,C,t,NULL,2
A,D,p,1,3
A,D,q,2,3
A,D,r,3,3
A,E,p,1,NULL
A,E,q,2,NULL
A,E,r,3,NULL
A,E,s,NULL,1
A,E,t,NULL,2
A,E,u,NULL,3
""")
df_source = pd.read_csv(SOURCE_DATA, sep=",")
df_expected = pd.read_csv(EXPECTED, sep=",")
print(df_source)
print(df_expected)
id = 'A'
df_target = df_source[(df_source['shop_id'] == id)]
df_non_target = df_source[(df_source['shop_id'] != id)]
df_result = df_target.merge(df_non_target, how="outer", on="product_id")
print(df_result)
Expected Result:
target_shop_id shop_id product_id target_level level
0 A B p 1.0 2.0
1 A B q 2.0 1.0
2 A B r 3.0 NaN
3 A B s NaN 3.0
4 A C p 1.0 3.0
5 A C q 2.0 1.0
6 A C r 3.0 NaN
7 A C t NaN 2.0
8 A D p 1.0 3.0
9 A D q 2.0 3.0
10 A D r 3.0 3.0
11 A E p 1.0 NaN
12 A E q 2.0 NaN
13 A E r 3.0 NaN
14 A E s NaN 1.0
15 A E t NaN 2.0
16 A E u NaN 3.0
My Result:
shop_id_x product_id level_x shop_id_y level_y
0 A p 1.0 B 2
1 A p 1.0 C 3
2 A p 1.0 D 3
3 A q 2.0 B 1
4 A q 2.0 C 1
5 A q 2.0 D 3
6 A r 3.0 D 3
7 NaN s NaN B 3
8 NaN s NaN E 1
9 NaN t NaN C 2
10 NaN t NaN E 2
11 NaN u NaN E 3
My result is missing 5 rows where the 'level_y' is NaN.
Any suggestions as to how to fix my code?