0

I have a dataset of web scraped reviews and unfortunately they contain a lot of the <br \> tags, so after I clean the data (remove stopwords etc.), a lot of single "br" remain in the dataset. I would like to remove these line breaks as well as some random alphanumeric characters (f.e. b00oex3) which make no sense in the text. So after cleaning this is an Example:

 product b001e5dxao br train chocolate chai mix 12 ounce bags br br

I would like to turn this into

product train chocolate chai mix ounce bags.

I've tried


gsub("(<br />)"," ",text)

but it returns the following error

Error in gsub(., "(
)", " ", text) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634 In addition: Warning message: In gsub(., "(
)", " ", text) : argument 'pattern' has length > 1 and only the first element will be used

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
TobiP
  • 1
  • 1
  • I don't get that error when I run your code. The error message appears to have a linebreak in it; maybe the pattern is being modified before being run? – user2554330 Apr 04 '23 at 10:07
  • I applied it to a whole dataset in a string of commands. Maybe this code only works for one sentence and not a whole frame – TobiP Apr 04 '23 at 10:44
  • 1
    The code you have shown us does not cause this error. You need to produce and post a [mcve]. – Konrad Rudolph Apr 04 '23 at 11:34
  • It looks like, the regex is too long. `rex <- paste(do.call(paste0, expand.grid(letters, letters, letters)), collapse = "|"); gsub(rex, "", "abc")` will give your first error message. The second comes from that your regex vector is longer than 1 e.g. `gsub(letters, "", "abc")` – GKi Apr 04 '23 at 12:00

1 Answers1

0

You could try working with read_html() and html_elements() from the rvest package to parse the html and avoid ending up with html markup in the first place.

dufei
  • 2,166
  • 1
  • 7
  • 18