Partial extraction of unstructured data on 2nd instance

Question

I have a huge text file from Edgar. I want to extract only a portion of text from business risk section.

For example if the text is like :

Bshehebvegegeveghdhebejejrjbfbfk

And I want to extract the start position as he(2nd instance) end position ge(second instance).

So my output will be - hebvegege

I want a code in R. And I am specially interested in the business risk section.

search for "regular expressions" and you will find examples of how to accomplish this — manotheshark, Feb 21 '17 at 16:44

akrun · Accepted Answer · 2017-02-21T16:57:19.857

One option is gregexpr to find the index of the starting character for the patterns 'he' and 'ge' and then use substr to specify the start and stop positions of the string to extract the substring

i1 <- gregexpr("he", str1)[[1]][2]
i2 <- gregexpr("ge", str1)[[1]][2] +1
substr(str1, i1, i2)
#[1] "hebvegege"

Or in a single step

do.call(substr, c(str1, lapply(c("he", "(?<=g)e"), 
     function(pat) gregexpr(pat, str1, perl=TRUE)[[1]][2]) ))
#[1] "hebvegege"

data

str1 <- "Bshehebvegegeveghdhebejejrjbfbfk"

Partial extraction of unstructured data on 2nd instance

1 Answers1

data