1

I apologize in advance if this has been covered, I could not find anything quite like this. This is my first programming job (I was previously software QA) and I've been beating my head against a wall on this.

I have 2 dataframes, one is very large [df2] (14.6 million lines) and I am iterating through it in chunks. I attempted to compare a column of the same name in each dataframe, if they're equal I would like to output a secondary column of the larger frame.

i.e.

if df1['tag'] == df2['tag']:
   df1['new column'] = df2['plate']

I attempted a merge but this didn't output what I expected.

df3 = pd.merge(df1, df2, on='tag', how='left')

I hope I did an okay job explaining this.

[Edit:] I also believe I should mention that df2 and df1 both have many additional columns I do not want to interact with/change. Is it possible to only compare the single columns of two dataframes, and output the third additional column?

smci
  • 32,567
  • 20
  • 113
  • 146
seclusion
  • 13
  • 3

3 Answers3

1

You may try inner merge. First, you may inner merge df1 with df2 and then you will get plates only for common rows and you can rename new df1's column as per your need

df1 = df1.merge(df2, on="tag", how = 'inner')

df1['new column'] = df1['plate']
del df1['plate']

I hope this works.

DJK
  • 8,924
  • 4
  • 24
  • 40
Hari_pb
  • 7,088
  • 3
  • 45
  • 53
  • @seclusion If you want to create new dataframe, that can also work as you need only limited columns. – Hari_pb Apr 17 '18 at 19:50
  • I tried this but received an error, I believe this is due to the first dataframe (df1) not having a column for "plate". I realized they would each need the same column (tag) in order to merge, does it need to have the plate column as well for the merge to be successful? – seclusion Apr 17 '18 at 20:04
  • No, you need only one common column `tag` to merge the two dataframes. Can you share a screenshot what you are trying ? – Hari_pb Apr 17 '18 at 20:06
  • For clarification 'ISO band' is 'tag'. Thank you immensely for your help: https://imgur.com/a/Trw6Y – seclusion Apr 17 '18 at 20:10
  • You are a step away, just need to add, df1 = df1....... on line 50, till you add changes to the dataframe df1, it will not be reflected. – Hari_pb Apr 17 '18 at 20:14
0

This is totally a case for join/merge. You want to put df2 on the left because it's smaller.

df2.join(df1, on='tag', ...)

You only misunderstood the type of join/merge) you want to make:

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’

'how'='left' join would create an (unwanted) entry for all rows of the LHS df2. That's not quite what you want (if df2 contained other tag values not seen in df1, you'd also get entries for them).

'how'='inner' would form the intersection of df2 and df1 on the 'on'='tag' field. i.e. you only get entries for where df1 contains a valid tag value according to df2.

So:

df3 = df2.join(df1, on='tag', how='inner')
# then reference df3['plate']

or if you only want the 'plate' column in df3 (or some other selection of columns), you can directly do:

df2.join(df1, on='tag', how='inner') ['plate']
smci
  • 32,567
  • 20
  • 113
  • 146
0

As smci mentioned, this is a perfect time to use join/merge. If you're looking to preserve df1, a left join is what you want. So you were on the right path:

df1 = pd.merge(df1['tag'],
               df2['tag', 'plate'],
               on='tag', how='left')
df1.rename({'plate': 'new column'}, axis='columns')

That will only compare the tag columns in each dataframe, so the other columns won't matter. It'll bring over the plate column from df2, and then renames it to whatever you want your new column to be named.

RCA
  • 508
  • 4
  • 12