2

I have a DataFrame as below

df

Index   Lines

0  /// User states this is causing a problem and but the problem can only be fixed by the user. /// User states this is causing a problem and but the problem can only be fixed by the user.
1  //- How to fix the problem is stated below. Below are the list of solutions to the problem. //- How to fix the problem is stated below. Below are the list of solutions to the problem.
2 \\ User describes the problem in the problem report.

I want to remove repeated sentences but not the duplicated words.

I tried the following solution but it also removes duplicated words in the process.

df['cleaned'] = (df['lines'].str.split()
                                  .apply(lambda x: OrderedDict.fromkeys(x).keys())
                                  .str.join(' '))

This results in

Index   cleaned

0  /// User states this is causing a problem and but the can only be fixed by user.
1  //- How to fix the problem is stated below. Below are list of solutions problem.
2 \ User describes the problem in report.

But the expected solution is :

Index   cleaned

0  /// User states this is causing a problem and but the problem can only be fixed by the user.
1  //- How to fix the problem is stated below. Below are the list of solutions to the problem.
2 \\ User describes the problem in the problem report.

How do I get it to remove the repeated lines but not the duplicate words? Is there a way to get this done ?

Is there a way in regex to grab the first sentence ending with a "." and checking if that first sentence appears again in the big string and remove everything from where the first string repeats till the end?

Please advice or suggest. Thanks!!

code_learner
  • 233
  • 1
  • 9
  • If I understood well, you have your dataframe which contains sentences for each element right? Something like this: df = { 0: "First sentence", 1: "Second sentence", ...}? Then, if a sentence appears more than once in that big string, remove the duplicates. Is it correct to think like this? – NickS1 Sep 30 '21 at 21:38
  • @NickS1 Almost correct, except I need only the repeated strings to be removed not the duplicated words within the strings. For instance, 0: "a a" where a is the big string repeated twice. I want this to be 0:"a", but whatever duplicated words are within a should not be removed. – code_learner Sep 30 '21 at 21:45
  • I've got it, thanks for explaining. There is another question. Do you have something separating each sentence from the next one? Like a blank space? I think you would like to remove them too right? – NickS1 Sep 30 '21 at 21:51
  • I think even the blank space can go as long as the lines are not repeated @NickS1 – code_learner Sep 30 '21 at 21:54
  • Sorry, @2e0byo has already solved it. I did not pay attention to the fact that each sentence ends with a period and a space. I'm really sorry haha – NickS1 Sep 30 '21 at 21:55

2 Answers2

0

Since your dataframe is just storing strings, let's just do it manually:

seen = set()
for i, row in enumerate(df["lines"]):
    lines = row.split(". ")
    keep = []
    for line in lines:
        line = line.strip()
            # if you want to clean up
            line = line.strip("\\/-").strip()
        if line[-1] != ".":
            line += "."
        if line not in seen:
            keep.append(line)
            seen.add(line)
    df["lines"][i] = " ".join(keep)

We iterate the column by row, split every line by ". " (which splits on sentences), and then if the sentence hasn't been seen already, we store it in a list. Then we set the row back to that list, joined up again.

Since the token we split by is removed, we append a "." to every sentence which doesn't end with one.

2e0byo
  • 5,305
  • 1
  • 6
  • 26
  • Problem is Series object does not have 'iterrows'. – code_learner Sep 30 '21 at 21:48
  • yes bother, sorry; updated. forgot about sequences. – 2e0byo Sep 30 '21 at 21:49
  • Somehow, it returns the same list of strings as before. It is because of the list ? Is there a way in regex to grab the first sentence ending with a "." and checking if that first sentence appears again in the big string and remove everything from where the first string repeats till the end? – code_learner Sep 30 '21 at 22:07
  • @code_learner oh bother I should stop answering qs tonight and go to bed; there's *another* typo---sholdn't be `lines`, should be `keep`..... – 2e0byo Sep 30 '21 at 22:21
  • 1
    Tested and it actually *works* now. Also added cleanup, but you might not want that. – 2e0byo Sep 30 '21 at 22:22
0

IIUC:

out = df['Lines'].str.findall(r'[^.]+').explode() \
                 .reset_index().drop_duplicates() \
                 .groupby('Index')['Lines'] \
                 .apply(lambda x: '.'.join(x))
>>> out[0]
 /// User states this is causing a problem and but the problem can only be fixed by the user

>>> out[1]
 //- How to fix the problem is stated below. Below are the list of solutions to the problem

>>> print(out[2])
\\ User describes the problem in the problem report
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • @code_learner. Can you check this possible solution. Let me know if something wrong. – Corralien Sep 30 '21 at 22:18
  • Works well for this example. Was wondering what if there are multiple lines within a column element like "this is great. works well. this is great. works well.". It will end up as "this is great. works well. this is great." Instead of finding "." can we find the first string "this is great" and check within the big string if the first string appears again and remove everything after? – code_learner Sep 30 '21 at 22:31
  • For the following line, what is the expected result: "this is great. works well. this is great. works well enough." – Corralien Oct 01 '21 at 06:03
  • Expected result is suppose to be "this is great. works well. " as the two lines are repeated. – code_learner Oct 01 '21 at 12:04
  • "this is great. works well. this is great. works well enough." <<< enough. – Corralien Oct 01 '21 at 12:04