1

I have the text below and need to extract specific words before and after a particular word

Example:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
stri_extract_all_fixed(sometext , c('engineering plastics', 'iso 9001','office automation'), case_insensitive=TRUE, overlap=TRUE)

Actual output below

[[1]]
[1] "engineering plastics"

[[2]]
[1] "iso 9001"

[[3]]
[1] "office automation"

Required output:

[1] globally expanding its engineering plastics centered on polycarbonate resin
[2] accordance with iso 9001 (8-4, 8-2), the regular implementation of

Basically need to extract text before and after my specific words mentioned

PRAVEEN R
  • 33
  • 1
  • 4
  • Your call to `stri_extract_all_fixed` references a variable `prav_1` that is not defined. Please make your example reproducible. – drammock Dec 28 '16 at 18:15
  • All text is before or after your specific words. You seem to want 3 words before "engineering plastics" and 4 words after; 2 words before "iso 9001" and quite a lot after... do you have a reliable logic you can explain about how much before and after you want to extract? – Gregor Thomas Dec 28 '16 at 19:30
  • please change prav_1 as sometext – PRAVEEN R Dec 29 '16 at 02:12
  • I am in need of 10 words before and 10 words after.. – PRAVEEN R Dec 29 '16 at 02:13

1 Answers1

0

This is some idea to start with:

sometext <- "about us, close, products & services, focus, close, research & development, topics, carbon fiber reinforced thermoplastic, separators for lithium ion batteries, close, for investors, close, jobs & careers, close, \nselect language\n\n, home > corporate social responsibility > \nsocial report\n >  quality assurance\n, \nensuring provision of safe products, \nthe teijin group resin & plastic processing business unit is globally expanding its engineering plastics centered on polycarbonate resin, where we hold a major share in growing asian markets. these products are widely used in applications such as automotive components, office automation equipment and optical discs (blu-ray, dvd). customers include automotive manufacturers, electronic equipment manufacturers and related mold companies. customer data is organized into a database as groundwork to actively promote efforts to enhance customer satisfaction., \nin accordance with iso 9001 (8-4, 8-2), the regular implementation of"
library(stringi)
words <- c('engineering plastics', 'iso 9001','office automation')
pattern <- stri_paste("([^ ]+ ){0,10}", words, "([^ ]+ ){0,10}")
stri_extract_all_regex(sometext , pattern, case_insensitive=TRUE, overlap=TRUE)

Explanation: I'm adding simple regex before and after your desired words:

"([^ ]+ ){0,10}"

which means:

  1. anything but space, repeated as many times as you can
  2. then space
  3. and all of this up to ten times

This is very simple and naive (eg it treats all the '&' or '>' as words) but works.

bartektartanus
  • 15,284
  • 6
  • 74
  • 102