I cannot find an easy way to get all the rows of a data frame that are found in one dataframe but not a second dataframe if the data is unordered.
These two answers talk are solutions for ordered data:
Get rows that are present in one dataframe, but not the other
So just to make it clear I'm trying to get this:
In one of those related question mentioned above I found a multiindexing solution that supposedly works with unordered data, but I was unable to implement it. I am hoping theres an easier way.
let me give you an example of the data I'm working with:
DF1
col_a col_b
1325 foo
1397 foo #<---matching value, but not matching index in DF2
1645 foo
... ...
DF2
col_1 col_2
1397 foo #<---matching value, but not matching index in DF1
1500 foo
1621 foo
... ...
Now if that were all the data in both dataframes the result for processing this specifically for DF1 would look like this:
DF1_UNIQUE
col_a col_b
1325 foo
1645 foo
(So I'm really only caring about col_a
or for DF2 col_1
). Notice its missing the 1397 row. that's because it is found in DF2, so I don't want it returned to my new DF. But its not found in the same index and there in lies the problem I have. I already easily created a solution if all the matching indexes are lined up, but I don't know where to start on the indexes that aren't lined up. Can I use the merge function? Or is that the wrong tool for this job?
This code isn't entirely relevant but its the solution I came up with if all the indexes lined up correctly:
def getUniqueEntries(df1, df2):
"""takes two dataframes, returns a dataframe that is comprized of all the rows unique to the first dataframe."""
d1columns = df1.columns
d2columns = df2.columns
df3 = pd.merge(df1, df2, left_on=d1columns[0], right_on=d2columns[0])
print(df3)
return df1[(~df1[d1columns[0]].isin(df3[d1columns[0]]))]
def main(fileread1, fileread2, writeprefix):
df1 = pd.read_csv(fileread1)
df2 = pd.read_csv(fileread2)
df3 = getUniqueEntries(df1, df2)
df4 = getUniqueEntries(df2, df1)
print(df3)
print(df4)
df3.to_csv(writeprefix+fileread1, index=False)
df4.to_csv(writeprefix+fileread2, index=False)
if __name__ == '__main__':
main(sys.argv[1], sys.argv[2], sys.argv[3])