0

I have a uncleaned character vector, and I want to remove certain characters in that vector that don't belong to another character vector. So basically I know what I want to keep, but I don't know exactly what to remove, which makes gsub() and str_replace_all hard to work.

The character string I want to clean is issue_uncleaned, and it looks like this (not the complete version):

[1] "Facebook Fact-checks; Coronavirus; TikTok posts "                                            
[2] "Facebook Fact-checks; Facebook posts "                                                       
[3] "Facebook Fact-checks; Coronavirus; Bloggers "                                                
[4] "Facebook Fact-checks; Facebook posts "                                                       
[5] "National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts "    

The character string I want to use as a filter to remove unwanted characters is 151_issues, and it looks like this(not the complete version):

[1] "Facebook Fact-checks"         "Coronavirus"       “Crime”                                      

My desired results: (if there are also ways to remove the ; at the beginning or at the last, it would be better)

[1] "Facebook Fact-checks; Coronavirus;  "                                            
[2] "Facebook Fact-checks;  "                                                       
[3] "Facebook Fact-checks; Coronavirus;  "                                                
[4] "Facebook Fact-checks;  "                                                       
[5] "; ; Crime; Facebook Fact-checks;  "  

Many thanks for your help!

Yaxin Dai
  • 41
  • 4
  • Can you please a character vector which would represent your desired result. – Shawn Brar Feb 20 '22 at 08:11
  • You don't really have named your issues `151_issues`, do you? Numbers as first character of object names are discouraged, further better to only use characters or underscore in object names. – jay.sf Feb 20 '22 at 08:30
  • Thank you for your suggestion! I am unaware of that… – Yaxin Dai Feb 20 '22 at 08:33

2 Answers2

0

Using strsplit then intersect and paste again.

sapply(lapply(strsplit(v, '; '), intersect, issues), paste, collapse='; ')
# [1] "Facebook Fact-checks; Coronavirus" "Facebook Fact-checks"             
# [3] "Facebook Fact-checks; Coronavirus" "Facebook Fact-checks"             
# [5] "Facebook Fact-checks"      

Data:

v <- c("Facebook Fact-checks; Coronavirus; TikTok posts", "Facebook Fact-checks; Facebook posts", 
"Facebook Fact-checks; Coronavirus; Bloggers", "Facebook Fact-checks; Facebook posts", 
"National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts"
)
issues <- c("Facebook Fact-checks", "After the Fact", "Animals", "Bankruptcy", 
"Border Security", "Ad Watch", "Agriculture", "Ask PolitiFact", 
"Baseball", "Bush Administration", "Afghanistan", "Alcohol", 
"Autism", "Bipartisanship", "Coronavirus")
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thank you so much!! It works perfectly. Sorry I misunderstood it at first. I have just started programming since this month so sorry about my silly mistakes. I have learned a lot from your answer~ – Yaxin Dai Feb 20 '22 at 09:15
0
issue_uncleaned <- c("Facebook Fact-checks; Coronavirus; TikTok posts ", "Facebook Fact-checks; Facebook posts ", "Facebook Fact-checks; Coronavirus; Bloggers ", "Facebook Fact-checks; Facebook posts ", "National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts ")
issues_151 <- c("Facebook Fact-checks", "Coronavirus", "Crime")
k <- strsplit(issue_uncleaned, "; ")
k <- lapply(k, trimws) # removes the white space at the end or beginning
k2 <- sapply(1:length(k), function(x, data){return(data[[x]][which(data[[x]] %in% issues_151)])}, data = k)
issue_cleaned <- sapply(k2, paste0, collapse = "; ")
Shawn Brar
  • 1,346
  • 3
  • 17