1

I have a vector of character. For each of these elements I am 100% sure there is a repetition that is always located at the start of the text.

A simplified example of a repeated sentence:

Hello. Hello. How are you?

Wait I aim for is just Hello. How are you?

Another example:

Hello I am Joe. Hello I am Joe. How are you?

In this case I would aim for: Hello I am Joe. How are you?

Another example of repetition:

Hello I a Hello I am Joe. How are you?

Another example of repetition:

Hello I am Jo Hello I am Joe. How are you?

In these cases, the desired output is still: Hello I am Joe. How are you?

Another example is the following:

Hello I am J Hello I am Joe. Joe is indeed my name

In this case, the desired output is:

Hello I am Joe. Joe is indeed my name

Notice that all the repetition happens before the desired output not in the middle, not in the end.

In my data I am sure that each text is at least of 440 characters and that this repeated text at the beginning is of random length, on average of 220 characters.

GiulioGCantone
  • 195
  • 1
  • 10
  • 1
    Your last example breaks the bank, and it is not clear what the rules are for ending up with `Hello I am Joe.` – Tim Biegeleisen Dec 11 '22 at 10:40
  • 1
    @TimBiegeleisen not really, as the repetition is then only "Hello I am J"... if you remove that, the output is then as requested. (perhaps removing whitespace if needed) – giocomai Dec 11 '22 at 12:45

2 Answers2

4

How about this?

libary(stringr)
str_remove(string, "(.*)\\s(?=\\1)")
[1] "Hello. How are you?"                   "Hello I am Joe. Joe is indeed my name" "Hello I am Joe. How are you?"         
[4] "Hello I am Joe. How are you?"          "Hello I am Joe. How are you?"          "Hello I am Joe. Joe is indeed my name"

How this works:

  • (.*): capture group matching anything
  • \\s: one whitespace
  • (?=\\1): positive lookahead asserting that what is captured in the capture group and 'remembered' by the backreference \\1 is getting repeated later in the string.

Data (thanks to @giocomai):

string <- c("Hello. Hello. How are you?", 
            "Hello I am J Hello I am Joe. Joe is indeed my name",
            "Hello I am Joe. Hello I am Joe. How are you?",
            "Hello I a Hello I am Joe. How are you?",
            "Hello I am Jo Hello I am Joe. How are you?",
            "Hello I am J Hello I am Joe. Joe is indeed my name")
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

If there are no other markers to understand when the useful part of the text begin, I suppose something such as the following might work. The idea is to truncate more and more of the original string. If the truncated string is found more than once in the text, then it checks if it is to be found twice in a row at the beginning of the text.

The script requires a minimum length of characters that can be repeated (even if in principle it could be set to 1). If no repetition is found, it returns the original string.

It may require some tweaking for edge cases, but it works with all the examples provided.

string <- c("Hello. Hello. How are you?", 
            "Hello I am J Hello I am Joe. Joe is indeed my name",
            "Hello I am Joe. Hello I am Joe. How are you?",
            "Hello I a Hello I am Joe. How are you?",
            "Hello I am Jo Hello I am Joe. How are you?",
            "Hello I am J Hello I am Joe. Joe is indeed my name")

minimum_repetition_nchar <- 3 #assuming repetition must be of at least 3 characters

purrr::map_chr(
  .x = string,
  .f = function(current_string) {
    
    nchar_to_check <- nchar(current_string):minimum_repetition_nchar
    
    for (current_nchar in nchar_to_check) {
      truncated_string <- stringr::str_trunc(string = current_string,
                                             width = current_nchar)
      
      n_matches <- stringr::str_count(string = current_string,
                                      pattern = truncated_string)
      
      if (n_matches>1) {
        if (stringr::str_starts(string = current_string, pattern = truncated_string)) {
          output <- stringr::str_remove(string = current_string,
                                        pattern = truncated_string)
          # check that repeated string is indeed at the beginning
          if (stringr::str_starts(string = output,
                                  pattern = truncated_string)) {
            return(output)
          }
        }
      } else {
        if (current_nchar==minimum_repetition_nchar) {
          return(current_string)
        }
      }
    }
}
)
#> [1] "Hello. How are you?"                  
#> [2] "Hello I am Joe. Joe is indeed my name"
#> [3] "Hello I am Joe. How are you?"         
#> [4] "Hello I am Joe. How are you?"         
#> [5] "Hello I am Joe. How are you?"         
#> [6] "Hello I am Joe. Joe is indeed my name"

Created on 2022-12-11 with reprex v2.0.2

giocomai
  • 3,043
  • 21
  • 24