I'm doing a text mining task in R.
Tasks:
1) count sentences
2) identify and save quotes in a vector
Problems :
False full stops like "..." and periods in titles like "Mr." have to be dealt with.
There's definitely quotes in the text body data, and there'll be "..." in them. I was thinking to extract those quotes from the main body and save them in a vector. (there's some manipulation to be done with them too.)
IMPORTANT TO NOTE : My text data is in a Word document. I use readtext("path to .docx file") to load in R. When I view the text, quotes are just " but not \" contrarily to the reproducible text.
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
reproducible text
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\\."))
The problem is it's splitting by false full-stops Solution I tried:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)
The problem with this is that it's not replacing [Miss. by [Miss
To identify quotes :
stri_extract_all_regex(text, '"\\S+"')
but that's not working too. (It's working with \" with the code below)
stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')
The exact expected vector is :
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
I wanted the sentences separated (so I can count how many sentences in each paragraph). And quotes also separated.
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""