How to remove HTML line breaks
?

Question

I have a dataset of web scraped reviews and unfortunately they contain a lot of the <br \> tags, so after I clean the data (remove stopwords etc.), a lot of single "br" remain in the dataset. I would like to remove these line breaks as well as some random alphanumeric characters (f.e. b00oex3) which make no sense in the text. So after cleaning this is an Example:

 product b001e5dxao br train chocolate chai mix 12 ounce bags br br

I would like to turn this into

product train chocolate chai mix ounce bags.

I've tried


gsub("(<br />)"," ",text)

but it returns the following error

Error in gsub(., "(
)", " ", text) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634 In addition: Warning message: In gsub(., "(
)", " ", text) : argument 'pattern' has length > 1 and only the first element will be used

I don't get that error when I run your code. The error message appears to have a linebreak in it; maybe the pattern is being modified before being run? — user2554330, Apr 04 '23 at 10:07
I applied it to a whole dataset in a string of commands. Maybe this code only works for one sentence and not a whole frame — TobiP, Apr 04 '23 at 10:44
The code you have shown us does not cause this error. You need to produce and post a [mcve]. — Konrad Rudolph, Apr 04 '23 at 11:34
It looks like, the regex is too long. `rex <- paste(do.call(paste0, expand.grid(letters, letters, letters)), collapse = "|"); gsub(rex, "", "abc")` will give your first error message. The second comes from that your regex vector is longer than 1 e.g. `gsub(letters, "", "abc")` — GKi, Apr 04 '23 at 12:00

score 0 · Answer 1 · answered Apr 04 '23 at 11:26

0

You could try working with read_html() and html_elements() from the rvest package to parse the html and avoid ending up with html markup in the first place.

answered Apr 04 '23 at 11:26

dufei

2,166
1
7
18

How to remove HTML line breaks?

1 Answers1

How to remove HTML line breaks
?