0

I am working on the Allen AI Science Challenge currently up on Kaggle.

The idea behind the challenge is to train to a model using the training data provided (a set of Eighth grade level science questions along with four answer options, one of which is the correct answer and the correct answer) along with any additional knowledge sources (Wikipedia, Science textbooks, etc) so that it can answer science questions as well as an (average?) Eighth grader can.

I'm thinking of taking the first crack at the problem in R (proficient only in R and C++; I don't think C++ will be a very useful language to solve this problem in). After exploring the Kaggle forums, I decided to use the TopicModels (tm), RWeka and Latent Dirichlet Algorithm (LDA) packages.

My current approach is to build a text predictor of some sort which on reading the question posed to it outputs a string of text and compute the cosine similarity between this output text and the four options given in the test set and predict the correct one to be with the highest cosine similarity.

I will train the model using the training data, a Wikipedia corpus along with a few Science textbooks so that the model does not overfit.

I have two questions here:

  1. Does the overall approach make sense?

  2. What would be a good starting point to build this text predictor? Will converting the corpus(training data, Wikipedia and Textbooks) to a Term Document/Document Term matrix help? I think forming n-grams for all the sources would help but I don't know what the next step would be, i.e. how exactly will the model predict and belt out a string of text(of say, size n) on reading a question.

I have tried implementing a part of the approach; finding out optimum number of topics and performing LDA over the training set; here's the code:

library(topicmodels)
library(RTextTools)

data<-read.delim("cleanset.txt", header = TRUE)
data$question<-as.character(data$question)
data$answerA<-as.character(data$answerA)
data$answerB<-as.character(data$answerB)
data$answerC<-as.character(data$answerC)
data$answerD<-as.character(data$answerD)

matrix <- create_matrix(cbind(as.vector(data$question),as.vector(data$answerA),as.vector(data$answerB),as.vector(data$answerC),as.vector(data$answerD)), language="english", removeNumbers=FALSE, stemWords=TRUE, weighting = tm::weightTf)
best.model<-lapply(seq(2,25,by=1),function(k){LDA(matrix,k)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
best.model.logLik.df <- data.frame(topics=c(2:25), LL=as.numeric(as.matrix(best.model.logLik)))
best.model.logLik.df[which.max(best.model.logLik.df$LL),]
best.model.lda<-LDA(matrix,25)

Any help will be appreciated!

cchamberlain
  • 17,444
  • 7
  • 59
  • 72
mscott
  • 1
  • 1
  • Welcome to SO. As written, your post is too broad as SO tends to be focused on specific coding and programming problems. With adaption, your question _might_ be okay for Cross Validated (i.e., if you ask about a specific data analysis or statistics question). However, I would suggest you try something and when you get stuck, come back with a specific question. – Richard Erickson Oct 22 '15 at 14:12
  • Thanks Richard. I have added a snippet of my code which is based on the approach I wish to discuss here. Hope it's better now. – mscott Oct 22 '15 at 14:41
  • 2
    Unfortunately, simply adding a code snippet does not change the situation. As @RichardErickson stated, this question is still too broad for SO. You would be better off posting this question in the Kaggle forums. – capt-calculator Oct 22 '15 at 15:08

0 Answers0