1

I am trying to apply an n-gram character model on a string to compute its probability in this model.

I created a character bigram model with stringdist::qgram():

library(tidyverse)
library(stringdist)

ref_corpus   <- c("This is a sample sentence", "Other sentences from the reference corpus", "Many other ones")
bigram_ref   <- qgrams(ref_corpus, q = 2)       # collecting all bigrams
bigram_model <- log(bigram_ref/sum(bigram_ref)) # computing the log probabilities of each 

bigram_model
#           Th        hi        is        s         sa        se        te        th
# V1 -4.356709 -4.356709 -3.663562 -3.258097 -4.356709 -3.663562 -3.663562 -3.258097

Now, I want to use this model to compute the probability of a new string within the model:

bigram_string <- qgrams("This one", q = 2) 
bigram_string
#    Th hi is s  on ne  o
# V1  1  1  1  1  1  1  1

I don't find how to multiply these two named matrices/vectors so that I can obtain the counts in bigram_string multiplied by the log probabilities in bigram_model.

Expected output:

bigram_string %*% bigram_model
#            Th        hi        is         s  ...
# V1  -4.356709 -4.356709 -3.663562 -3.258097  ...

# Actual output:
# Error in bigram_string %*% bigram_model : non-conformable arguments

I made some progress with subsetting:

bigram_model["V1",][bigram_string]

# But output:
#        Th        Th        Th        Th        Th        Th        Th 
# -4.356709 -4.356709 -4.356709 -4.356709 -4.356709 -4.356709 -4.356709
iNyar
  • 1,916
  • 1
  • 17
  • 31

1 Answers1

1

Perhaps, we need to subset the column names

bigram_model[, colnames(bigram_string)] * bigram_string

-output

        Th        hi        is        s         on        ne         o
V1 -4.356709 -4.356709 -3.663562 -3.258097 -4.356709 -4.356709 -3.663562
akrun
  • 874,273
  • 37
  • 540
  • 662