I have a DataFrame as below
df
Index Lines
0 /// User states this is causing a problem and but the problem can only be fixed by the user. /// User states this is causing a problem and but the problem can only be fixed by the user.
1 //- How to fix the problem is stated below. Below are the list of solutions to the problem. //- How to fix the problem is stated below. Below are the list of solutions to the problem.
2 \\ User describes the problem in the problem report.
I want to remove repeated sentences but not the duplicated words.
I tried the following solution but it also removes duplicated words in the process.
df['cleaned'] = (df['lines'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
This results in
Index cleaned
0 /// User states this is causing a problem and but the can only be fixed by user.
1 //- How to fix the problem is stated below. Below are list of solutions problem.
2 \ User describes the problem in report.
But the expected solution is :
Index cleaned
0 /// User states this is causing a problem and but the problem can only be fixed by the user.
1 //- How to fix the problem is stated below. Below are the list of solutions to the problem.
2 \\ User describes the problem in the problem report.
How do I get it to remove the repeated lines but not the duplicate words? Is there a way to get this done ?
Is there a way in regex to grab the first sentence ending with a "." and checking if that first sentence appears again in the big string and remove everything from where the first string repeats till the end?
Please advice or suggest. Thanks!!