Python Pandas: Compare two CSV files and delete lines from both the file by matching a column

Question

We need to delete lines from both of the files if the value of the first column is not present in the other file.

Lets consider two CSV files:

file1.csv:
yrdi_391    111    1.11    1.0    1.1    111.0
yfyrn_9132  222    2.22    2.0    2.2    222.0
kdkfke_392  999    9.99    9.0    9.9    999.0
hfeisk_3    333    3.33    3.0    3.3    333.0

file2.csv:
yrdi_391    444    4.44    4.0    4.4    444.0
yfyrn_9132  555    5.55    5.0    5.5    555.0
hfeisk_3    666    6.66    6.0    6.6    666.0
fhedn_271   888    8.88    8.0    8.8    888.0

Now, we need to delete entire line starting with kdkfke_392 from the file1.csv as it's not present therein file2.csv.

On the other hand, we need to delete the entire line starting with fhedn_271, as it's not present in file1.csv.

Expected result:

file1.csv:
yrdi_391    111    1.11    1.0    1.1    111.0
yfyrn_9132  222    2.22    2.0    2.2    222.0
hfeisk_3    333    3.33    3.0    3.3    333.0

file2.csv:
yrdi_391    444    4.44    4.0    4.4    444.0
yfyrn_9132  555    5.55    5.0    5.5    555.0
hfeisk_3    666    6.66    6.0    6.6    666.0

As of now, the lines in file1.csv and file2.csv are not sorted. If required, we may do the sorting first and then apply the deletion.

Pandas CVS related manipulations are preferred as in both of the files we have headers and need to keep them.

Newbie in python scripting!

Any help will be highly appreciated!

@RahulAgarwal I do not want to join these files. Need to update them while keeping them separated. — RandomCoder, Nov 13 '18 at 14:52

Abhi · Accepted Answer · 2018-11-13T16:33:56.113

2

You can use isin().

print (df)

            0    1     2    3    4      5
0    yrdi_391  111  1.11  1.0  1.1  111.0
1  yfyrn_9132  222  2.22  2.0  2.2  222.0
2  kdkfke_392  999  9.99  9.0  9.9  999.0
3    hfeisk_3  333  3.33  3.0  3.3  333.0

print (df1)

            0    1     2    3    4      5
0    yrdi_391  444  4.44  4.0  4.4  444.0
1  yfyrn_9132  555  5.55  5.0  5.5  555.0
2    hfeisk_3  666  6.66  6.0  6.6  666.0
3   fhedn_271  888  8.88  8.0  8.8  888.0

csv_df = df[df[0].isin(df1[0])]

print (csv_df)
            0    1     2    3    4      5
0    yrdi_391  111  1.11  1.0  1.1  111.0
1  yfyrn_9132  222  2.22  2.0  2.2  222.0
3    hfeisk_3  333  3.33  3.0  3.3  333.0

csv_df1 = df1[df1[0].isin(df[0])]

print (csv_df1)
            0    1     2    3    4      5
0    yrdi_391  444  4.44  4.0  4.4  444.0
1  yfyrn_9132  555  5.55  5.0  5.5  555.0
2    hfeisk_3  666  6.66  6.0  6.6  666.0

csv_df.to_csv('temp.csv', index=False)
csv_df1.to_csv('temp1.csv', index=False)

edited Nov 13 '18 at 16:33

answered Nov 13 '18 at 15:26

Abhi

4,068
1
16
29

Giving a couple of exceptions! – RandomCoder Nov 13 '18 at 16:04
Traceback (most recent call last): File "/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 0 _and another one_ – RandomCoder Nov 13 '18 at 16:16
@RandomCoder You have to use your first column name which you want to check in place of `0` something like `df[df[your col name].isin(df1[your col name])]` – Abhi Nov 13 '18 at 16:27

Python Pandas: Compare two CSV files and delete lines from both the file by matching a column

1 Answers1