Keep only the text of a label

Question

In a text which have formating labels such as

data.frame(id = c(1, 2), text = c("something here <h1>my text</h1> also <h1>Keep it</h1>", "<h1>title</h1> another here"))

How can someone keep with a comma separate option only the text exist inside in <h1> </h1>:

data.frame(text = c("my text, Keep it", "title"), id = c(1, 2))

akrun · Answer 1 · 2020-08-04T23:52:16.430

3

We can use str_extract_all. Using regex lookaround, get the characters after the tag, then loop over the list output and paste the extracted strings

library(stringr)
data.frame(text = sapply(str_extract_all(df1$text, "(?<=<h1>)[^<]+"), 
      paste, collapse=", "), id = df1$id)
#               text id
#1 my text, Keep it  1
#2            title  2

edited Aug 04 '20 at 23:52

answered Aug 04 '20 at 18:07

akrun

874,273
37
540
662

Darren Tsai · Answer 2 · 2020-08-05T06:32:43.893

3

You can use the web scraping skill.

library(rvest)

sapply(df$text, function(x) {
  read_html(x) %>% html_nodes(css = "h1") %>% html_text %>% toString
}, USE.NAMES = F)

# [1] "my text, Keep it"
# [2] "title"

edited Aug 05 '20 at 06:32

answered Aug 04 '20 at 18:15

Darren Tsai

32,117
5
21
51

score 1 · Answer 3 · answered Aug 10 '20 at 23:11

If you want to use quanteda for this, you can process convert this to a corpus, and then process it via two corpus_segment() calls, one to get the text before , and the second to then just select the text after . Then you can re-group the text using texts(x, groups = docid()), specifying the spacer = ", ".

Here's how, with your desired output:

library("quanteda")
## Package version: 2.1.1

df <- data.frame(
  id = c(1, 2),
  text = c("something here <h1>my text</h1> also <h1>Keep it</h1>", "<h1>title</h1> another here")
)

charvec <- corpus(df, docid_field = "id") %>%
  corpus_segment("</h1>", pattern_position = "after") %>%
  corpus_segment("<h1>", pattern_position = "before") %>%
  texts(groups = docid(.), spacer = ", ")

Then to convert this into the data.frame that you want:

data.frame(text = charvec, id = names(charvec))
##               text id
## 1 my text, Keep it  1
## 2            title  2

Keep only the text of a label

3 Answers3