0

I was wonderig if it's possible to split up ngram-features in a document-feature matrix (dfm) in such a way that e.g. a bigram results in two separate unigrams?

head(dfm, n = 3, nfeature = 4)

docs       in_the great plenary emission_reduction
  10752099      3     1       1                  3
  10165509      8     0       0                  3
  10479890      4     0       0                  1

So, the above dfm would result in something like this:

head(dfm, n = 3, nfeature = 4)

docs       in great plenary emission the reduction
  10752099  3     1       1        3   3         3
  10165509  8     0       0        3   8         3
  10479890  4     0       0        1   4         1

For better understanding: I got the ngrams in the dfm from translating the features from German to English. Compounds ("Emissionsminderung") are quiet common in German but not in English ("emission reduction").

Thank you in advance!

EDIT: The following can be used as reproducible example.

library(quanteda)

eg.txt <- c('increase in_the great plenary', 
            'great plenary emission_reduction', 
            'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)

head(eg.dfm)
uyanik
  • 63
  • 7
  • What if you have 2 bigrams containing the same word, say "emission_reduction" and "emission_increase", should the numbers in the column should sum for the common words ("emission" in the example)? Disclaimer: not an expert here, maybe I'm saying something making no sense... – digEmAll May 24 '17 at 12:57
  • Yes, say we have twice the bigram "emission_reduction" and once "emission_increase" in a document, the result should be a total of 3 "emission", 2 "reduction", and 1 "increase". When e.g. "increase" is also included as an unigram feature, the sum for "increase" should be 2. – uyanik May 24 '17 at 13:04
  • Unfortunately I don't know dfm format and I don't know if it works like data.frames... could you post a reproducible sample of the data (e.g. posting the output of dput(head(dfm))? – digEmAll May 24 '17 at 13:08
  • Of course! Please find the example in the original question above. Thanks! – uyanik May 24 '17 at 13:23

1 Answers1

0

I don't know if the best approach (it might use a lot of RAM since it turns the sparse dfm to a data.frame/matrix), but it should work :

# turn the dft into a matrix (transposing it)
DF <- as.data.frame(eg.dfm)
MX <- t(DF)
# split the current column names by '_'
colsSplit <- strsplit(colnames(DF),'_')
# replicate the rows of the matrix and give them the new split row names
MX <-MX[unlist(lapply(1:length(colsSplit),function(idx) rep(idx,length(colsSplit[[idx]])))),]
rownames(MX) <- unlist(colsSplit)
# aggregate the matrix rows having the same name and transpose again
MX2 <- t(do.call(rbind,by(MX,rownames(MX),colSums)))
# turn the matrix into a dfm
eg.dfm.res <- as.dfm(MX2)

Result :

> eg.dfm.res
Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
3 x 7 sparse Matrix of class "dfmSparse"
       features
docs    emission great in increase plenary reduction the
  text1        0     1  1        1       1         0   1
  text2        1     1  0        0       1         1   0
  text3        2     0  1        2       0         1   1
digEmAll
  • 56,430
  • 9
  • 115
  • 140
  • It seems to work perfectly fine if I add `DF <- as.data.frame(eg.dfm)` in the beginning. Right? – uyanik May 24 '17 at 14:04
  • Great! It's a nice workaround with the data frame, I think. Thanks for your help! – uyanik May 24 '17 at 14:15
  • 1
    See https://stackoverflow.com/questions/44538939/split-up-ngrams-in-sparse-document-feature-matrix for a better solution, that preserves sparsity. – Ken Benoit Jun 14 '17 at 13:20