I am trying to create a dataset for some neural network learning purposes. Previously, I have used for loop to concatenate and make sentences but since the process was taking so long I implemented the sentence generation using foreach. The process was fast and completed under 50 sec. I am just using slot filling on a template which is then pasted together to form a sentence but the output is getting garbled (spelling errors in words, unknown spaces in between words, words itself gone missing etc..)
library(foreach)
library(doParallel)
library(tictoc)
tic("Data preparation - parallel mode")
cl <- makeCluster(3)
registerDoParallel(cl)
f_sentences<-c();sentences<-c()
hr=38:180;fl=1:5;month=1:5
strt<-Sys.time()
a<-foreach(hr=38:180,.packages = c('foreach','doParallel')) %dopar% {
foreach(fl=1:5,.packages = c('foreach','doParallel')) %dopar%{
foreach(month=1:5,.packages = c('foreach','doParallel')) %dopar% {
if(hr>=35 & hr<=44){
sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_low).","Around",toString(fl),
"soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
if(hr>=45 & hr<=59){
sentences<-paste("About",toString(hr),"soldiers died in the battle (count being low).","Around",toString(fl),
"soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
if(hr>=60 & hr<=100){
sentences<-paste("About",toString(hr),"soldiers died in the battle (count being medium).","Around",toString(fl),
"soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
if(hr>=101 & hr<=150){
sentences<-paste("About",toString(hr),"soldiers died in the battle (count being high).","Around",toString(fl),
"soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
if(hr>=151 & hr<=180){
sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_high).","Around",toString(fl),
"soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
return(outfile)
}
write.table(outfile,file="/home/outfile.txt",append = T,row.names = F,col.names = F)
gc()
}
}
stopCluster(cl)
toc()
The stats of the file so created :
- Number of lines: 427,975
- Splitting used : word split (" ")
Vocabulary: 567
path<-"/home/outfile.txt"
File<-(fread(path,sep = "\n",header = F))[[1]]
corpus<-tolower(File) %>%
#removePunctuation() %>%
strsplit(splitting) %>%
unlist()
vocab<-unique(corpus)
A simple sentence like this should have vocabulary very less as the numbers are the only changing parameters here. On checking the vocab output and by using the grep command, I found a lot of garbled words (some missing words too) like wentt,crpply etc. coming in the sentence, which normally shouldn't come as I have a fixed template.
Expected sentence
"About 40 soldiers died in the battle (count being severly_low). Around 1 soldiers and civilians went missing. We only have about 146 crates which lasts for 1 months as food supply"grep -rnw 'outfile.txt' -e 'wentt'
24105:"About 62 soldiers died in the battle (count being medium). Around 2 soldiers and civilians wentt 117 crates which lasts for 1 months as food supply"grep -rnw 'outfile.txt' -e 'crpply'
76450:"About 73 soldiers died in the battle (count being medium). Around 1 soldiers and civilians went missing. We only have about 133 crpply"For the first few sentences, the generation is correct after that the problem occurs. What is the reason for this? I am just performing normal paste with slot filling. Any help would be appreciated!