0

I have a huge text file from Edgar. I want to extract only a portion of text from business risk section.

For example if the text is like :

Bshehebvegegeveghdhebejejrjbfbfk

And I want to extract the start position as he(2nd instance) end position ge(second instance).

So my output will be - hebvegege

I want a code in R. And I am specially interested in the business risk section.

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
user35655
  • 13
  • 5

1 Answers1

0

One option is gregexpr to find the index of the starting character for the patterns 'he' and 'ge' and then use substr to specify the start and stop positions of the string to extract the substring

i1 <- gregexpr("he", str1)[[1]][2]
i2 <- gregexpr("ge", str1)[[1]][2] +1
substr(str1, i1, i2)
#[1] "hebvegege"

Or in a single step

do.call(substr, c(str1, lapply(c("he", "(?<=g)e"), 
     function(pat) gregexpr(pat, str1, perl=TRUE)[[1]][2]) ))
#[1] "hebvegege"

data

str1 <- "Bshehebvegegeveghdhebejejrjbfbfk"
akrun
  • 874,273
  • 37
  • 540
  • 662