remove rows from one dataframe based on conditions from another dataframe in pandas Python

Question

I have two pandas data frame contains millions of rows in python. I want to remove rows from the first data frame that contains words in seconds data frame based on three conditions:

If the word appears at the beginning of the sentence in a row
If the word appears at the end of the sentence in a row
If the word appears in the mid the sentence in a row (exact word, not a subset)

Example:

First Dataframe:

This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence

Second Dataframe:

Second
forth
fifth

Output Expected:

This is the first sentence
This is fifth_sentence

Please note that I have millions of records in both the data frame, how can I process it and export in the most efficient way?

I tried but it takes very much time

import pandas as pd
import re

bad_words_file_data = pd.read_csv("words.txt", sep = ",", header = None)
sentences_file_data = pd.read_csv("setences.txt", sep = ".", header = None)

bad_words_index = []
for i in sentences_file_data.index:
    print("Processing Sentence:- ", i, "\n")
    single_sentence = sentences_file_data[0][i]
    for j in bad_words_file_data.index:
        word = bad_words_file_data[0][j]
        if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
            bad_words_index.append(i)
            break
            
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None, index = False)

Thanks

sophocles · Accepted Answer · 2021-06-11T09:19:41.460

You can use numpy.where function and create a variable called 'remove' which will mark as 1 if the conditions you outlined are satisfied. Firstly, create a list with the values of df2

Condition 1: will check whether the cell values start with any of the values in your list

Condition 2: same as above but it will check if cell values end with any of the values in your list

Condition 3: Splits each cell and checks if any value from the splitter string are in your list

Thereafter, you can create your new dataframe with filtering out the 1:

# Imports
import pandas as pd
import numpy as np

# Get the values from df2 in a list
l = list(set(df2['col']))

# Set conditions
c = df['col']

cond = (c.str.startswith(tuple(l)) \
        |(c.str.endswith(tuple(l))) \
        |pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))

# Assign 1 or 0
df['remove'] = np.where(cond,1,0)

# Create 
out = (df[df['remove']!=1]).drop(['remove'],axis=1)

out prints:

                          col
0  This is the first sentence
4      This is fifth_sentence

References:

Pandas Row Select Where String Starts With Any Item In List

check if a columns contains any str from list

Dataframes used:

>>> df.to_dict()

{'col': {0: 'This is the first sentence',
  1: 'Second this is another sentence',
  2: 'This is the third sentence forth',
  3: 'This is fifth sentence',
  4: 'This is fifth_sentence'}}

>>> df2.to_dict()

Out[80]: {'col': {0: 'Second', 1: 'forth', 2: 'fifth'}}

Thanks, the solution is correct. But my dataset has millions of rows, and it fails while running it on a 32 GB and 16 core machine. Can you please provide a more efficient code? — Tanmay Jain, Jun 11 '21 at 09:33
Could you *time* each line and check which one is the one that needs to be changed? — sophocles, Jun 11 '21 at 11:30

remove rows from one dataframe based on conditions from another dataframe in pandas Python

1 Answers1