6

I have two DataFrames that have the same column names with some matching data and some unique data.

I want to exclude the middle and only save what is unique to both DataFrames.

How would I concat or merge or join these two dataframes to do so?

For instance in this image I do not want the middle in this image, I want both sides but not the middle:

enter image description here

Here's my code right now:

def query_to_df(query):
    ...
    df_a = pd.DataFrame(data_a)
    df_b = pd.DataFrame(data_b)
    outer_results = pd.concat([df_a, df_b], axis=1, join='outer')
    return df

Let me give you an example of what I need:

df_a = 
col_a  col_b  col_c
   a1     b1     c1
   a2     b2     c2

df_b = 
col_a  col_b  col_c
   a2     b2     c2
   a3     b3     c3

# they only share the 2nd row:    a2     b2     c2 
# so the outer result should be:
col_a  col_b  col_c  col_a  col_b  col_c
   a1     b1     c1     NA     NA     NA
   NA     NA     NA     a3     b3     c3

or I'd be just as happy with 2 dataframes

result_1 =
col_a  col_b  col_c
   a1     b1     c1

result_2 =
col_a  col_b  col_c
   a3     b3     c3

Lastly, you'll notice that a2 b2 c2 were excluded because all of the columns match - how do I specify that I want to join based on all the columns, not just 1? If df_a had had a2 foo c2 I would have wanted that row to be in result_1 as well.

MetaStack
  • 3,266
  • 4
  • 30
  • 67
  • 1
    I think you are looking for `pd.merge` not `pd.concat`. Actually, not quite sure how you're merging... I think a more accurate description is that you are getting rows that are only present within one table. Even then, I'm not sure as to why you have them side by side, since now you have 2 `col_a` and `col_b` and `col_c` – AsheKetchum Nov 07 '17 at 20:30

3 Answers3

6

Use pd.DataFrame.drop_duplicates
This assumes the rows were unique in their respective dataframes.

df_a.append(df_b).drop_duplicates(keep=False)

  col_a col_b col_c
0    a1    b1    c1
1    a3    b3    c3

You can even use pd.concat with the keys parameter to give the context in which the row came.

pd.concat([df_a, df_b], keys=['a', 'b']).drop_duplicates(keep=False)

    col_a col_b col_c
a 0    a1    b1    c1
b 1    a3    b3    c3
piRSquared
  • 285,575
  • 57
  • 475
  • 624
5

Use merge with indicator parameter and outer join first and then filter by query or boolean indexing:

df = df_a.merge(df_b, how='outer', indicator=True)
print (df)
  col_a col_b col_c      _merge
0    a1    b1    c1   left_only
1    a2    b2    c2        both
2    a3    b3    c3  right_only

a = df.query('_merge == "left_only"').drop('_merge', 1)
print (a)
  col_a col_b col_c
0    a1    b1    c1

b = df.query('_merge == "right_only"').drop('_merge', 1)
print (b)
  col_a col_b col_c
2    a3    b3    c3

Or:

a = df[df['_merge'] == "left_only"].drop('_merge', 1)
print (a)
  col_a col_b col_c
0    a1    b1    c1

b = df[df['_merge'] == "right_only"].drop('_merge', 1)
print (b)
  col_a col_b col_c
2    a3    b3    c3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

concat and drop_duplicates with keep = False

new_df = pd.concat([df_a, df_b]).drop_duplicates(keep=False)

    col_a   col_b   col_c
0   a1      b1      c1
1   a3      b3      c3

Using numpy setdiff1

df_a = pd.DataFrame(np.setdiff1d(np.array(df_a.values), np.array(df_b.values))\
.reshape(-1, df_a.shape[1]), columns = df_a.columns)

df_b = pd.DataFrame(np.setdiff1d(np.array(df_b.values), np.array(df_a.values))\
.reshape(-1, df_b.shape[1]), columns = df_b.columns)

df_a

    col_a   col_b   col_c
0   a1      b1      c1

df_b

    col_a   col_b   col_c
0   a3      b3      c3
Vaishali
  • 37,545
  • 5
  • 58
  • 86