Removing non-alpanumeric characters from an ordered collection of objects (list) in R

Question

I have a question about removing non-alphanumeric characters from a list in R. I have a list will all sorts of odd characters, blanks, etc. and would like to remove them. I'm generally able to remove what I want using the tm package in r. I fiddled around with it, but got nowhere so thought going back to the list may be the place to start.

The list:

 list("\n    \n", "\n\n  ", "\n        ", "               ", "\n    ", 
 "\n            \n      ", "\n        ", "Home", "\n", "Expertise", 
 "Question & Research Design", "\n", "Survey Development & Validation", 
 "\n", "Data Processing", "\n", "Statistical Analysis", "\n", 
 "Publications & Grants", "\n", "Evaluation", "\n", "\n", 
 "Consulting Areas", "Business", "\n", "Education", "K-12", 
 "\n", "Â ", " Â Â  Â  Â", " | ")

The expected output

[1] ""                               ""                         ""
[4] ""                               ""                         ""
[7] ""                               "Home"                     ""
[10] "Expertise"                     "Question Research Design" ""
[13] "Survey Development Validation" ""                         "Data Processing"
[16] ""                              "Statistical Analysis"     ""
[19] "Publications Grants"           ""                         "Evaluation"
[22] ""                              ""                         "Consulting Areas"
[25] "Business"                      ""                         "Education"
[28] "K12"                           ""                         ""
[31] ""                              ""

Do you want to remove the weird symbols and blanks and whatnot from the character strings or do you want to remove any string that contains weird characters from the list? It be best if you provided your expected output for your example data. — Dason, Jun 20 '12 at 22:37

Tim P · Accepted Answer · 2012-06-21T02:02:55.123

5

Strongly recommend you simply use

gsub("[^a-zA-Z0-9]","",x)

where x is the name of the list.

You probably included the foreign characters at the end of the list because you want these obliterating too - well, the above command achieves this. To explain briefly, the square brackets in the command define a collection of symbols, and the ^ symbol means "not", so everything that is not in the specified set of 62 characters (lower case a to z, upper case A to Z, and digits 0 to 9) will be replaced by the empty string "" (i.e. destroyed).

And here's the output...

 [1] ""                             ""                        ""
 [4] ""                             ""                        ""
 [7] ""                             "Home"                    ""
[10] "Expertise"                    "QuestionResearchDesign"  ""
[13] "SurveyDevelopmentValidation"  ""                        "DataProcessing"
[16] ""                             "StatisticalAnalysis"     ""
[19] "PublicationsGrants"           ""                        "Evaluation"
[22] ""                             ""                        "ConsultingAreas"
[25] "Business"                     ""                        "Education"
[28] "K12"                          ""                        ""
[31] ""                             ""

edited Jun 21 '12 at 02:02

answered Jun 21 '12 at 01:50

Tim P

1,383
9
19

1

`gsub(" +"," ",gsub("^ +","",gsub("[^a-zA-Z0-9 ]","",x)))` gives neat output with spaces included, including wiping out fields that consist solely of spaces and consolidating consecutive spaces into a single space. – Tim P Jun 21 '12 at 02:10
In my previous comment, the inner `gsub` is simply the command given in my solution with spaces included in the set of allowed characters; the next gsub out `gsub("^ +","",...)` wipes out initial spaces (which obviously eliminates strings consisting only of spaces); and the outermost command `gsub(" +"," ",...)` replaces each occurrence of one or more spaces with a single space. – Tim P Jun 21 '12 at 02:14
That solved the problem of getting rid of the characters. I ran it through the rest of the of my code and it worked out nicely. Thanks! – Tom Jun 21 '12 at 02:14
Excellent news! Let me know if you have any other issues getting it up and running, otherwise I await your juicy green tick ;) – Tim P Jun 21 '12 at 02:15
That's wonderful use of regex in gsub. Now I know how to apply regex in R – user3670684 Jan 23 '15 at 07:37

score 0 · Answer 2 · answered Jun 20 '12 at 22:37

I'm not sure if this gets rid of everything you're wanting to remove... But ?regexp describes all sorts of intersting broad classes you can use. For what you're describing, I think you want:

 gsub('[[:space:]|[:punct:]]+', '', yourlist)

Which gives:

 [1] ""                            ""                            ""                            ""                           
 [5] ""                            ""                            ""                            "Home"                       
 [9] ""                            "Expertise"                   "QuestionResearchDesign"      ""                           
[13] "SurveyDevelopmentValidation" ""                            "DataProcessing"              ""                           
[17] "StatisticalAnalysis"         ""                            "PublicationsGrants"          ""                           
[21] "Evaluation"                  ""                            ""                            "ConsultingAreas"            
[25] "Business"                    ""                            "Education"                   "K12"                        
[29] ""                            "Â"                           "ÂÂÂÂ"                        ""

That is pretty much what I'm looking for, I just added the spaces back between the words. I did get to this point using another less efficient method so this works much better, but I'm still struggling trying to get rid of other odd characters like the A-hat. — Tom, Jun 21 '12 at 01:56
Seen my solution below which handles the foreign characters like the A-hat? Add a space between the 9 and the ] in my solution to allow spaces too :) — Tim P, Jun 21 '12 at 02:06

Removing non-alpanumeric characters from an ordered collection of objects (list) in R

2 Answers2