How to remove words not in caps in R?

Question

I'm doing text analysis using R. Is there a way to remove all the words not in caps using tm or stringi?

If I have something like this

Albert Einstein went to the store and saw his friend Nikola Tesla ... + 200 pags

to be converted into

Albert Einstein Nikola Tesla

Best regards

David Arenburg · Answer 1 · 2016-05-03T20:17:39.023

8

You could just remove those words using a simple regex

gsub("\\b[a-z]+\\s+", "", x)
# [1] "Albert Einstein Nikola Tesla"

This is just looking for a word boundary > low case letter > all the letters after it > all the spaces after it and removes it

Though in cases you have words such as don't, you would need a bit more complicated regex. Something like

x <- "if Albert Einstein didn't see his friend Nikola Tesla leavin'"
gsub("\\b[a-z][^ ]*(\\s+)?", "", x)
# [1] "Albert Einstein Nikola Tesla "

edited May 03 '16 at 20:17

answered May 03 '16 at 19:58

David Arenburg

91,361
17
137
196

`(\\s+)?` is also known as `\\s*` :) the extra space is annoying - not obvious how to fix in a single regex (without making a complete mess out of it) – eddi May 03 '16 at 21:43
@eddi, yeah thanks- though not sure if any better. Regarding the extra space- it will only happen in that certain case when there some punctuation at the end of the whole document, but I guess it's not such a big loss for the OP, eh? – David Arenburg May 03 '16 at 21:55

score 6 · Accepted Answer · answered May 03 '16 at 19:56

Just use grep and a regular expression:

words <- 'Albert Einstein went to the store and saw his friend Nikola Tesla'

# split to vector of individual words
vec <- unlist(strsplit(words, ' '))
# just the capitalized ones
caps <- grep('^[A-Z]', vec, value = T)
# assemble back to a single string, if you want
paste(caps, collapse=' ')

How to remove words not in caps in R?

2 Answers2