0

I have a big dataframe with news articles. I have noticed that some of the articles have two words connected by a dot as the following examples shows The government.said it was important to quit.. I will conduct some topic modelling, so I need to separate every single word.

This is the code I have used to separate those words

    #String example
    test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")

    #Code to separate the words
    test <- do.call(paste, as.list(strsplit(test, "\\.")[[1]]))

   #This is what I get
  > test
  [1] "i need to separate the words connected by dots  however, I need to keep having the dots separating sentences"

As you can see, I deleted all the dots (periods) on the text. How could I get the following outcome:

"i need to separate the words connected by dots. however, I need to keep having the dots separating sentences"

Final note

My dataframe is composed of 17.000 articles; all the text is on lowercase. I just provided a small example of the issue I am having when trying to separate two words connected by a dot. Additionally, is there any way I can use strsplit on a list?

M--
  • 25,431
  • 8
  • 61
  • 93
Jose David
  • 139
  • 9

1 Answers1

0

You may use

test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences. Look at http://google.com for s.0.m.e more details.")
# Replace each dot that is in between word characters
gsub("\\b\\.\\b", " ", test, perl=TRUE)
# Replace each dot that is in between letters
gsub("(?<=\\p{L})\\.(?=\\p{L})", " ", test, perl=TRUE)
# Replace each dot that is in between word characters, but no in URLs
gsub("(?:ht|f)tps?://\\S*(*SKIP)(*F)|\\b\\.\\b", " ", test, perl=TRUE)

See the R demo online.

Output:

[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s 0 m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s.0.m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google.com for s 0 m e more details."

Details

  • \b\.\b - a dot that is enclosed with word boundaries (i.e. before and after . cannot be any non-word char, there cannot be any char other than a letter, digit or underscore
  • (?<=\p{L})\.(?=\p{L}) matches a dot that is not immediately preceded nor followed with a letter ((?<=\p{L}) is a negative lookbehind and the (?=\p{L}) is a negative lookahead)
  • (?:ht|f)tps?://\\S*(*SKIP)(*F)|\b\.\b matches http/ftp or https/ftps, then :// and then any 0 or more non-whitespace chars, and skips the match and goes on to search for matches from the position it was when it came across the SKIP PCRE verb.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563