how to delete rows from a csv file which string in 1st column is the same of string in 1st column of another csv?

Question

I have 2 csv files. I need to delete all rows of the first file in which the first column have a string found in the first column of the second file. The head of table 1 is:

Genus	FAGR	MOCA	MUBR	MUHA
1-14-0-20-45-16	0	0	40	0
1-14-0-20-46-22	0	0	0	169
2-02-FULL-61-13	0	0	0	27
2-12-FULL-35-15	56	182	435	311

The head of table 2 is:

Genus	FAGR	MOCA	MUBR
1-14-0-20-46-22	0	0	0
2-02-FULL-61-13	0	0	0
21-14-0-10-47-8-A	0	0	0
AAA536-G1	0	0	0

The expected output file contains the rows of file 1 except the rows that match the first 2 rows of the second file (which have in common the following strings in the first column: 1-14-0-20-46-22 and 2-02-FULL-61-13). When the complete files are compared, the entire file 2 has to be removed from file 1.

I am going through pandas indexing and selecting data but still cannot find a solution, probably becase I am a newbie.

I tried the solution posted and it came out like this:

df1 = generagrouped_df
df2['drop_key'] = 'DROP'
output = pd.merge(
left = df1,
right = df2,
how = 'left',
left_on = ['Genus'],
right_on = ['Genus']
)
output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)

The error message was KeyError: 'drop_key' (below):

KeyError                                  Traceback (most recent call last)
<ipython-input-103-67d27afa824b> in <module>()
----> 1 output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)

/Users/AnaPaula/opt/anaconda2/lib/python2.7/site-packages.   /pandas/core/frame.pyc in __getitem__(self, key)


2925             if self.columns.nlevels > 1:
2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
2928             if is_integer(indexer):    
2929                 indexer = [indexer]
/Users/AnaPaula/opt/anaconda2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_loc(self, key, method, tolerance)
2657                 return self._engine.get_loc(key)
2658             except KeyError:   
-> 2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))
2660         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
 2661         if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
 KeyError: 'drop_key'

Can you figure out the solution? Thanks AP

Please post the CSV files as text, not images. We aren't going to transpose them ourselves. And include a working script with your first attempt. You could skip the CSV and just initialize dataframes with the desired data. — tdelaney, Aug 25 '21 at 22:02
I don't use pandas, but you can do what you need with the csv module. 1) read the second file and store the 1st column of each row in a list. 2) read the first file and store the row in a list if row[0] is not in the previous list. — PandaBlue, Aug 25 '21 at 22:12

score 0 · Answer 1 · answered Aug 25 '21 at 22:07

Try adding a new column to the csv file where the drop keys are located, then dropping on that condition by index:

import pandas as pd

file1 = pd.read_csv('file_1.csv')
file2 = pd.read_csv('file_2.csv')

# Assign the keyword drop to the file with the strings you're looking
# to drop from your final solution.
file2['drop_key'] = 'DROP'

# Merge the files together
output = pd.merge(
    left = file1,
    right = file2,
    how = 'left',
    left_on = ['str_col'],
    right_on = ['str_col']
)

# Drop the rows that have the keyword 'DROP'
output.drop(output[output['drop_key'] == 'DROP'].index, inplace = True)

Please note that left_on and right_on should be the name of the columns containing the strings you're matching on. These were not available in the screenshots you provided, so I'm assuming a name of str_col.

Hi @rossdrucker9, thaks. I tried but got KeyError: 'drop_key'. I edited the post with the code and error message, and included just the head of the tables. Initially the advice on formatting tables did not work. For this edit I searched at Meta to find out how to post tables. It was not straightforward... — Nana_marinbio, Aug 27 '21 at 14:32

score 0 · Accepted Answer · answered Aug 27 '21 at 15:13

I found a solution. Since the entire file 2 had to be removed from file 1, I did the following command, which informed just the first column to be compared, and it worked:

df1.loc[pd.merge(df1, df2, on=['Genus'], how='left', indicator=True)['_merge'] == 'left_only']

Thanks for you time! AP

how to delete rows from a csv file which string in 1st column is the same of string in 1st column of another csv?

2 Answers2