Text Extraction in R with stringi package

Question

I have the text below and need to extract specific words before and after a particular word

Example:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
stri_extract_all_fixed(sometext , c('engineering plastics', 'iso 9001','office automation'), case_insensitive=TRUE, overlap=TRUE)

Actual output below

[[1]]
[1] "engineering plastics"

[[2]]
[1] "iso 9001"

[[3]]
[1] "office automation"

Required output:

[1] globally expanding its engineering plastics centered on polycarbonate resin
[2] accordance with iso 9001 (8-4, 8-2), the regular implementation of

Basically need to extract text before and after my specific words mentioned

Your call to `stri_extract_all_fixed` references a variable `prav_1` that is not defined. Please make your example reproducible. — drammock, Dec 28 '16 at 18:15
All text is before or after your specific words. You seem to want 3 words before "engineering plastics" and 4 words after; 2 words before "iso 9001" and quite a lot after... do you have a reliable logic you can explain about how much before and after you want to extract? — Gregor Thomas, Dec 28 '16 at 19:30

score 0 · Answer 1 · answered Feb 17 '17 at 21:42

This is some idea to start with:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
words <- c('engineering plastics', 'iso 9001','office automation')
pattern <- stri_paste("([^ ]+ ){0,10}", words, "([^ ]+ ){0,10}")
stri_extract_all_regex(sometext , pattern, case_insensitive=TRUE, overlap=TRUE)

Explanation: I'm adding simple regex before and after your desired words:

"([^ ]+ ){0,10}"

which means:

anything but space, repeated as many times as you can
then space
and all of this up to ten times

This is very simple and naive (eg it treats all the '&' or '>' as words) but works.

Text Extraction in R with stringi package

1 Answers1