3

I am trying to create a dataset for some neural network learning purposes. Previously, I have used for loop to concatenate and make sentences but since the process was taking so long I implemented the sentence generation using foreach. The process was fast and completed under 50 sec. I am just using slot filling on a template which is then pasted together to form a sentence but the output is getting garbled (spelling errors in words, unknown spaces in between words, words itself gone missing etc..)

library(foreach)
library(doParallel)
library(tictoc)

tic("Data preparation - parallel mode")
cl <- makeCluster(3)
registerDoParallel(cl)

f_sentences<-c();sentences<-c()
hr=38:180;fl=1:5;month=1:5
strt<-Sys.time()
a<-foreach(hr=38:180,.packages = c('foreach','doParallel')) %dopar% {
  foreach(fl=1:5,.packages = c('foreach','doParallel')) %dopar%{
    foreach(month=1:5,.packages = c('foreach','doParallel')) %dopar% {
      if(hr>=35 & hr<=44){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_low).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=45 & hr<=59){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being low).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=60 & hr<=100){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being medium).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=101 & hr<=150){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being high).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=151 & hr<=180){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_high).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      return(outfile)
    }
    write.table(outfile,file="/home/outfile.txt",append = T,row.names = F,col.names = F)
    gc()
  }
}
stopCluster(cl)
toc()

The stats of the file so created :

  • Number of lines: 427,975
  • Splitting used : word split (" ")
  • Vocabulary: 567

    path<-"/home/outfile.txt"
    File<-(fread(path,sep = "\n",header = F))[[1]]
    corpus<-tolower(File) %>%
    #removePunctuation() %>%
    strsplit(splitting) %>%
    unlist()
    vocab<-unique(corpus)

    A simple sentence like this should have vocabulary very less as the numbers are the only changing parameters here. On checking the vocab output and by using the grep command, I found a lot of garbled words (some missing words too) like wentt,crpply etc. coming in the sentence, which normally shouldn't come as I have a fixed template.

    Expected sentence
    "About 40 soldiers died in the battle (count being severly_low). Around 1 soldiers and civilians went missing. We only have about 146 crates which lasts for 1 months as food supply"

    grep -rnw 'outfile.txt' -e 'wentt'
    24105:"About 62 soldiers died in the battle (count being medium). Around 2 soldiers and civilians wentt 117 crates which lasts for 1 months as food supply"

    grep -rnw 'outfile.txt' -e 'crpply'
    76450:"About 73 soldiers died in the battle (count being medium). Around 1 soldiers and civilians went missing. We only have about 133 crpply"

    For the first few sentences, the generation is correct after that the problem occurs. What is the reason for this? I am just performing normal paste with slot filling. Any help would be appreciated!

Sooraj
  • 514
  • 4
  • 20

1 Answers1

3

The code is running correctly now. There are no more errors. I am assuming the error occurred due to a glitch last time. Tested this in other machines with varying R versions, still no problem.

Sooraj
  • 514
  • 4
  • 20