1

I have a text like this:

Insanely good Insanely good music. Kanye West is GOAT. The sky is blue.

I want a function that whatever is the first sequence of a string, remove it if it's repeated.

In the case above, it would be mutated into:

Insanely good music. Kanye West is GOAT. The sky is blue.

I only want to remove the first repetition, not all.

I remember that in stringr or in stringi there is a function that does exactly this, but I do not remember which one.

GiulioGCantone
  • 195
  • 1
  • 10

1 Answers1

1

Here is a regex based solution using gsub:

x <- "Insanely good Insanely good music. Kanye West is GOAT. The sky is blue."
output <- gsub("\\b\\s*(\\w+)\\s*\\b(?=[^.]*\\b\\1\\b)", " ", x, perl=TRUE)
output <- gsub("^\\s+|\\s+$", "", output)
output

[1] "Insanely good music. Kanye West is GOAT. The sky is blue."

The first regex substitution finds any words which appear later in the string, and removes them. The second call to gsub removes any dangling whitespace from the start or end of the string.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Why this works in the example above but not with `Insanely good music. Insanely good music. Kanye West is GOAT. The sky is blue.` ? It seems that dots (and possible other regexes) are an issue. – GiulioGCantone Jun 05 '22 at 19:50
  • Well we would need some way to identify individual sentences, using regex or any other method. – Tim Biegeleisen Jun 05 '22 at 23:22
  • Don't you think that it's just a problem to recognize punctuation? I could remove punctuation from the text but I guess if there is a way to preserve it – GiulioGCantone Jun 06 '22 at 06:25
  • The issue is how to distiniguish period when it means end of sentence versus some other context (e.g. an abbreviation). In general, you would need an NLP grammar library to handle this. – Tim Biegeleisen Jun 06 '22 at 06:34