I am using R to carry out an analysis of Wikidata dumps. I have previously extracted the variables I need from the XML dumps and create my own dataset in smaller csv files. Here how my files look like.
Q939818;35199259;2013-05-04T20:28:48Z;KLBot2;/* wbcreateclaim-create:2| */ [[Property:P373]], Tour de Pologne 2010
Q939818;72643278;2013-09-26T03:46:26Z;Coyau;/* wbcreateclaim-create:1| */[[Property:P107]]: [[Q1656682]]
Q939818;72643283;2013-09-26T03:46:28Z;Coyau;/* wbcreateclaim-create:1| */[[Property:P31]]: [[Q2215841]]
Q939818;90117273;2013-11-28T14:14:04Z;DanmicholoBot;/* wbsetlabel-add:1|nb */from the [no] label
Q939818;90117281;2013-11-28T14:14:07Z;DanmicholoBot;/* wbsetlabel-remove:1|no */
Q939818;92928394;2013-11-28T14:14:07Z;DanmicholoBot;/* wbsetlabel-remove:1|no */
Unfortunately, the script to extract the variables sometimes skips some tags, so in some lines the item ID (the first value) is not present and it is replaced by "wikimedia page".
I would like to infer the missing item IDs by checking the time in the third column: if the time in the line with the missing value is previous to the following one, then I can assume that the item IDs is the same (they are two revisions of the same value). Otherwise, the item ID will be the same as the previous line.
To do that, I wrote some code that first checks for all the lines with "wikimedia page" in the first column and then does what I have just described:
wikimedia_lines <- grep("wikimedia page", wikiedits_clean$V1)
for (i in wikimedia_lines){
if (wikiedits_clean$time[i] < wikiedits_clean$time[i + 1]) {
wikiedits_clean$V1[i] <- wikiedits_clean$V1[i + 1]
}
else {wikiedits_clean$V1[i] <- wikiedits_clean$V1[i - 1] }
}
However, since my files are quite big (~6.5M lines), it takes a lot of time to execvute the loop. Is there some more 'R-style' (like using apply or sapply) solution that could do that in a more efficient way?
Thank you.