I have a dataset of web scraped reviews and unfortunately they contain a lot of the <br \>
tags, so after I clean the data (remove stopwords etc.), a lot of single "br" remain in the dataset.
I would like to remove these line breaks as well as some random alphanumeric characters (f.e. b00oex3) which make no sense in the text. So after cleaning this is an Example:
product b001e5dxao br train chocolate chai mix 12 ounce bags br br
I would like to turn this into
product train chocolate chai mix ounce bags.
I've tried
gsub("(<br />)"," ",text)
but it returns the following error
Error in gsub(., "(
)", " ", text) : assertion 'tree->num_tags == num_tags' failed in executing regexp: file 'tre-compile.c', line 634 In addition: Warning message: In gsub(., "(
)", " ", text) : argument 'pattern' has length > 1 and only the first element will be used