3

I'm doing text analysis using R. Is there a way to remove all the words not in caps using tm or stringi?

If I have something like this

Albert Einstein went to the store and saw his friend Nikola Tesla ... + 200 pags

to be converted into

Albert Einstein Nikola Tesla

Best regards

pachadotdev
  • 3,345
  • 6
  • 33
  • 60

2 Answers2

8

You could just remove those words using a simple regex

gsub("\\b[a-z]+\\s+", "", x)
# [1] "Albert Einstein Nikola Tesla"

This is just looking for a word boundary > low case letter > all the letters after it > all the spaces after it and removes it


Though in cases you have words such as don't, you would need a bit more complicated regex. Something like

x <- "if Albert Einstein didn't see his friend Nikola Tesla leavin'"
gsub("\\b[a-z][^ ]*(\\s+)?", "", x)
# [1] "Albert Einstein Nikola Tesla "
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • `(\\s+)?` is also known as `\\s*` :) the extra space is annoying - not obvious how to fix in a single regex (without making a complete mess out of it) – eddi May 03 '16 at 21:43
  • @eddi, yeah thanks- though not sure if any better. Regarding the extra space- it will only happen in that certain case when there some punctuation at the end of the whole document, but I guess it's not such a big loss for the OP, eh? – David Arenburg May 03 '16 at 21:55
6

Just use grep and a regular expression:

words <- 'Albert Einstein went to the store and saw his friend Nikola Tesla'

# split to vector of individual words
vec <- unlist(strsplit(words, ' '))
# just the capitalized ones
caps <- grep('^[A-Z]', vec, value = T)
# assemble back to a single string, if you want
paste(caps, collapse=' ')
arvi1000
  • 9,393
  • 2
  • 42
  • 52