Remove Numbers, Punctuations, White Spaces before Tokenization

Question

I have the following data frame

report <- data.frame(Text = c("unit 1 crosses the street", 
       "driver 2 was speeding and saw driver# 1", 
        "year 2019 was the year before the pandemic",
        "hey saw       hei hei in        the    wood",
        "hello: my kityy! you are the best"), id = 1:5)
report 
                                         Text id
1                   unit 1 crosses the street  1
2     driver 2 was speeding and saw driver# 1  2
3  year 2019 was the year before the pandemic  3
4 hey saw       hei hei in        the    wood  4
5           hello: my kityy! you are the best  5

From a previous coding help, we can remove stop words using the following code.

report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b', 
                          collapse = '|'), '', report$Text)
report
                                    Text id
1                 unit 1 crosses  street  1
2      driver 2  speeding  saw driver# 1  2
3            year 2019   year   pandemic  3
4 hey saw       hei hei             wood  4
5                 hello:  kityy!    best  5

The above data still has noises (numbers, punctuations, and white space). Need to get the data in the following format by removing these noises before tokenization. Additionally, I want to remove selected stop words (for example, saw and kitty).

                                    Text id
1                   unit crosses  street  1
2                driver speeding  driver  2
3                     year year pandemic  3
4                       hey hei hei wood  4
5                             hello best  5

akrun · Accepted Answer · 2022-04-22T15:36:27.523

4

We may get the union of tm::stopwords and the new entries, paste them with collapse = "|", remove those with replacement as "" in gsub, along with removing the punctuations and digits and extra spaces (\\s+ - one or more spaces)

trimws(gsub("\\s+", " ", 
 gsub(paste0("\\b(", paste(union(c("saw", "kityy"), 
   tm::stopwords("english")), collapse="|"), ")\\b"), "", 
     gsub("[[:punct:]0-9]+", "", report$Text))
))

-output

[1] "unit crosses street" 
[2  "driver speeding driver" 
[3] "year year pandemic"   
[4] "hey hei hei wood"   
[5] "hello best"

edited Apr 22 '22 at 15:36

answered Apr 22 '22 at 15:28

akrun

874,273
37
540
662

Need additional input. We also want to remove word length less than 4. If the code is updated, I will update the question. Then these words will be removed (`hey` `hei`). – S Das Apr 22 '22 at 15:38
@SDas Can you post as a new question as the length with `nchar` requires some extra changes in the code – akrun Apr 22 '22 at 15:39
1

@SDas You can do. if `tmp` is the earlier output; `map_chr(str_extract_all(tmp, "\\w+"), ~ str_c(.x[str_length(.x) > 3], collapse = " "))# [1] "unit crosses street" "driver speeding driver" "year year pandemic" "wood" "hello best"` – akrun Apr 22 '22 at 15:42

Remove Numbers, Punctuations, White Spaces before Tokenization

1 Answers1

Linked