R text mining - remove special characters and quotes

Question

I'm doing a text mining task in R.

Tasks:

1) count sentences

2) identify and save quotes in a vector

Problems :

False full stops like "..." and periods in titles like "Mr." have to be dealt with.

There's definitely quotes in the text body data, and there'll be "..." in them. I was thinking to extract those quotes from the main body and save them in a vector. (there's some manipulation to be done with them too.)

IMPORTANT TO NOTE : My text data is in a Word document. I use readtext("path to .docx file") to load in R. When I view the text, quotes are just " but not \" contrarily to the reproducible text.

path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text

reproducible text

text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ... 
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
 \"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "


#  splitting by "." 
unlist(strsplit(text, "\\."))

The problem is it's splitting by false full-stops Solution I tried:

# getting rid of . in titles 
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")

library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)

The problem with this is that it's not replacing [Miss. by [Miss

To identify quotes :

stri_extract_all_regex(text, '"\\S+"')

but that's not working too. (It's working with \" with the code below)

stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')

The exact expected vector is :

sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""

I wanted the sentences separated (so I can count how many sentences in each paragraph). And quotes also separated.

quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""

It is weird you have a full stop after `Miss` as it is not an abbreviation. Even if you remove a dot with `text <- gsub("Miss.", "Miss", text, fixed=TRUE)`, I cannot leverage the `tm` / `OpenNLP` package as it parses out a `[4] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n \"Mom how are you o.k. with being called Mrs. Keyboard?"` sentence. What are your sentence separation rules? Should any text in double quotation mark be extracted as is, even if there are multiple sentences inside? — Wiktor Stribiżew, Oct 23 '18 at 07:50
However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...] is a sentence and "Mom how are you o.k. with being called Mrs. Keyboard?" is another since they are separated by a \n — Yeshyyy, Oct 23 '18 at 09:23
expected result is extracting sentences despite the false periods. And extracting whole quotes — Yeshyyy, Oct 23 '18 at 09:25
Please update the question itself. Add the exact expected character vector. — Wiktor Stribiżew, Oct 23 '18 at 09:29
Ok, you may match all your current `vec` values using `gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)`. Note this won't handle `o.k.`. You might use another approach for that. But splitting into sentences does not seem clear. — Wiktor Stribiżew, Oct 23 '18 at 10:30
@Wiktor Stribiżew The above code worked great with dealing with false periods thank you! Would you have a solution for the second part of the question : extracting whole quotes from a text and saving them in a vector? — Yeshyyy, Oct 24 '18 at 11:54
If you just want to extract quotes, try `regmatches(text, gregexpr('"[^"]*"', text))` — Wiktor Stribiżew, Oct 24 '18 at 11:55
Perfect Wiktor thank you! If you want to post an answer and I'll accept it. Otherwise I can summarize and post your answers, upto you. Cheers. — Yeshyyy, Oct 24 '18 at 12:01
I have tried something like `regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+(?:\\s|[?!.])+[^"[:alnum:]]*', trimws(text)))`. Is it any better? — Wiktor Stribiżew, Oct 24 '18 at 12:04

score 1 · Accepted Answer · answered Oct 24 '18 at 12:07

You may match all your current vec values using

gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)

That is, \w+ matches 1 or more word chars and \. matches a dot.

Next, if you just want to extract quotes, use

regmatches(text, gregexpr('"[^"]*"', text))

The " matches a " and [^"]* matches 0 or more chars other than ".

If you plan to match your sentences together with quotes, you might consider

regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))

Details

\\s* - 0+ whitespaces
"[^"]*" - a ", 0+ chars other than " and a "
| - or
[^"?!.]+ - 0+ chars other than ?, ", ! and .
[[:space:]?!.]+ - 1 or more whitespace, ?, ! or . chars
[^"[:alnum:]]* - 0+ non-alphanumeric and " chars

R sample code:

> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "                                                       
[2] "Keyboard Jr and Miss Keyboard. ... \n"                                                         
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""

R text mining - remove special characters and quotes

Tasks:

Problems :

reproducible text

1 Answers1