I'm trying to clusterize short documents like, e.g., the following
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
In the initialization step of my code, I should randomly assign the documents to K
clusters, according to a Dirichlet Multinomial Distribution.
How could I perform this task?
Edit Thanks to @ags29's comment, I found in Sampling from Dirichlet-Multinomial
D=9 # number of documents in the corpus; I have 9 sentences in my example
k=2 # number of clusters (e.g. 2)
alpha=runif(D) # value of alpha, here chosen at random
p=rgamma(D,alpha) # pre-simulation of the Dirichlet
x=rmultinom(1,k,p)
What do you think?