Keep special characters in a word-frequency matrix

Question

I analyze some brands in text to find out KPI´s like Ad recognition. However brands which contain special characters are destroyed by my code so far.

library(qdap)
library(stringr)
test <- c("H&M", "C&A", "Zalando", "Zalando", "Amazon", "Sportscheck")

wfm(test)

This is the output:

            all
a             1
amazon        1
c             1
h             1
m             1
sportscheck   1
zalando       2

Is there a package or method to archieve that H&M gets h&m, but not "h" and "m", like its two brands?

edit: The wfm function has got a ... argument which SHOULD allow me to use the strip function.

wfm(test, ... = strip(test, char.keep = "&"))

Does not work unfortunately.

A decent text example would be helpful. Your test object could be split by `strsplit` and then counted, but that is probably not what you are looking for. Most text mining tools would remove the special chars, so depending on your text there might be some functions that can help to preserve them. — phiver, Nov 18 '18 at 11:25
Take this text as an answer of one Person. My data consists of a dataframe (or better character vector) with 70.000 rows for 70.000 persons. — , Nov 18 '18 at 11:32

score 0 · Answer 1 · answered Nov 18 '18 at 11:31

0

I am not familiar with the qdap package but maybe substituting & could solve your problem

replacement <- "" # set your replacement e.g. "" (empty string) or "_"
test <- gsub("&", replacement, test, fixed = T)

answered Nov 18 '18 at 11:31

niko

5,253
1
12
32

score 0 · Answer 2 · answered Nov 18 '18 at 12:04

I would say something like this. In the udpipe package there is a function document_term_frequencies where you can specify the split and it turns the data into a data.frame with the frequency count. If there is no id column to specify it will generate one. The resulting object of the document_term_frequencies is a data.table.

library(udpipe)

# data.frame without a ID column
my_data <- data.frame(text = c("H&M, C&A, Zalando, Zalando, Amazon, Sportscheck", 
                               "H&M, C&A, Amazon, Sportscheck"),
                      stringsAsFactors = FALSE)

# if you have an ID column add document = my_data$id to the function
# see more examples in ?document_term_frequencies
document_term_frequencies(my_data$text, split = ",")

   doc_id         term freq
1:   doc1          H&M    1
2:   doc1          C&A    1
3:   doc1      Zalando    2
4:   doc1       Amazon    1
5:   doc1  Sportscheck    1
6:   doc2          H&M    1
7:   doc2          C&A    1
8:   doc2       Amazon    1
9:   doc2  Sportscheck    1

That sounds like spelling or capital issues. What is the result if you first do tolower on the text? Otherwise, you need to provide more examples or a dput(head(your data, 20)) to get the first 20 rows. — phiver, Nov 18 '18 at 12:43

Keep special characters in a word-frequency matrix

2 Answers2