0

I learned join methods in sql, and I know that inner join means returning only the intersections of the two different tables that we want to set.

I thought for python the concept is same. But I have problem understanding the certain code.

crsp1=pd.merge(crsp, crsp_maxme, how='inner', on=['jdate','permco','me'])

crsp1=crsp1.drop(['me'], axis=1)

crsp2=pd.merge(crsp1, crsp_summe, how='inner', on=['jdate','permco'])

If I understood correctly, the first line merges table crsp and crsp_maxme with intersection on column 'jdate', 'permco', 'me'. So the table crsp1 would have 3 columns. The second line drops the 'me' column of table crsp1. The last lien would merge newly adjusted table crsp1 and crsp_summe with inner join, with intersection on 'jdate' and 'permco'. Which makes newly merged table crsp2 only having 2 columns.

However, the code explanation from line 2 says that the second and third lines drop 'me' column from crsp1 and then replace it with 'me' from crsp_summe table, which I had problem understanding.

Could anyone clarify these lines for me?

PS: I thought it isn't necessary to explain what the table crsp, crsp_summe, and crsp_maxme since they are all framed by inner join function. So please excuse the lack of background info.

cycla
  • 147
  • 5
  • 1
    If you want to understand then study this using pure SQL. Any padding interferes with understanding. – Akina Feb 02 '21 at 13:04
  • "I know that inner join means returning only the intersections of the two different tables" -- not really. You can join using any predicate you want. – The Impaler Feb 02 '21 at 13:04
  • @TheImpaler Oh, I must have misunderstood the join concept about sql then:) So according to correct knowledge about the join methods, the explanation about replacing 'me' from 'crsp1' with 'me' from 'crsp_summe' is possible since inner join on 'jdate','permco' doesn't mean returning only 2 columns? – cycla Feb 02 '21 at 13:14
  • As @Akina says, if you want to understand SQL, use SQL. Pandas (for python) is not SQL – nacho Feb 02 '21 at 14:06
  • @cycla The confusion you have about joins is that you implicitly talk about "equi joins". The tool you are using in the example `on=['jdate','permco','me']` probably only implements this type of join, and that's OK; equi joins are the most common type of joins. Nevertheless, your tool is hiding a world of possibilities. – The Impaler Feb 02 '21 at 18:06

1 Answers1

1

The merge() functions on parameter specifies on what columns you want to make joins. how specifies what type of join you want to apply (similar to sql joins as outer, inner, left, right etc.).

Ex: suppose there are two tables A and B containing columns as A['x1','x2','x3'] and B['x2','y1'] so joining them based on 'x1' (as it is common column in both table) would produce A_join_B_on_x1['A_B_x1','A_x2','A_x3','B_y1'] and the join will based on how you want to join.

in your current code consider,

A = crsp1
B = crsp_maxme
C = crsp_summe

Now in your program your first line merges your A,B on ['jdate','permco','me'] columns and creates a new dataframe A_B containing ['jdate','permco','me',...'+columns_from_both_tables(A)(B)'] as inner join (i.e rows which are common in both A,B based on ['jdate','permco','me'] columns)

enter image description here

second line drops 'me' column from A_B dataframe. so it will be something like ['jdate','permco',...'+columns_from_both_tables(A)(B)']

enter image description here

third line merges your A_B,C on ['jdate','permco'] and creates ['jdate','permco',...'+columns_from_both_tables(A_B)(C)'] as inner join (i.e rows which are common in both A_B,C based on ['jdate','permco','me'] columns)

enter image description here

k33da_the_bug
  • 812
  • 8
  • 16