0

I am using R to carry out an analysis of Wikidata dumps. I have previously extracted the variables I need from the XML dumps and create my own dataset in smaller csv files. Here how my files look like.

Q939818;35199259;2013-05-04T20:28:48Z;KLBot2;/* wbcreateclaim-create:2| */ [[Property:P373]], Tour de Pologne 2010
Q939818;72643278;2013-09-26T03:46:26Z;Coyau;/* wbcreateclaim-create:1| */[[Property:P107]]: [[Q1656682]]
Q939818;72643283;2013-09-26T03:46:28Z;Coyau;/* wbcreateclaim-create:1| */[[Property:P31]]: [[Q2215841]]
Q939818;90117273;2013-11-28T14:14:04Z;DanmicholoBot;/* wbsetlabel-add:1|nb */from the [no] label
Q939818;90117281;2013-11-28T14:14:07Z;DanmicholoBot;/* wbsetlabel-remove:1|no */
Q939818;92928394;2013-11-28T14:14:07Z;DanmicholoBot;/* wbsetlabel-remove:1|no */

Unfortunately, the script to extract the variables sometimes skips some tags, so in some lines the item ID (the first value) is not present and it is replaced by "wikimedia page".

I would like to infer the missing item IDs by checking the time in the third column: if the time in the line with the missing value is previous to the following one, then I can assume that the item IDs is the same (they are two revisions of the same value). Otherwise, the item ID will be the same as the previous line.

To do that, I wrote some code that first checks for all the lines with "wikimedia page" in the first column and then does what I have just described:

wikimedia_lines <- grep("wikimedia page", wikiedits_clean$V1)

for (i in wikimedia_lines){
  if (wikiedits_clean$time[i] < wikiedits_clean$time[i + 1]) {
     wikiedits_clean$V1[i] <- wikiedits_clean$V1[i + 1] 
  }
  else {wikiedits_clean$V1[i] <- wikiedits_clean$V1[i - 1] }
}

However, since my files are quite big (~6.5M lines), it takes a lot of time to execvute the loop. Is there some more 'R-style' (like using apply or sapply) solution that could do that in a more efficient way?

Thank you.

Aliossandro
  • 209
  • 3
  • 12

2 Answers2

0

I suggest the following:

data <- read.table(filename,
                   sep=";",
                   header=TRUE,
                   colClasses=c("character","character","character","character","character"))

data$time <- as.POSIXct(data$time,format="%Y-%m-%dT%H:%M:%S")

m <- which( data$ID == "wikimedia page" )
n <- m[which( data$time[m]-data$time[m+1] >= 0 )]

cleanData <- data

cleanData$ID[n]             <- data$ID[n-1]
cleanData$ID[setdiff(m,n)]  <- data$ID[setdiff(m,n)+1]

"m" is the vector of row numbers where the "ID" is missing. "n" is the vector of those row numbers in "m" where the time is not previous to the time in the next row.

mra68
  • 2,960
  • 1
  • 10
  • 17
0

If there are missing ID's in consecutive rows, my previous solution couldn't fill all the gaps. The following solution is more complicated, but it can handle this case:

data <- read.table(filename,
                   sep=";",
                   header=TRUE,
                   colClasses=c("character","character","character","character","character"))

data$time <- as.POSIXct(data$time,format="%Y-%m-%dT%H:%M:%S")

m <- sort( which( data$ID == "wikimedia page" ) )
d <- diff(c(-1,m))
e <- diff(c(0,diff(m)==1,0))

b1 <- c(-Inf, m[which( e>0 | (d>1 & e==0) )], Inf)
b2 <- c(-Inf, m[which( e<0 | (d>1 & e==0) )], Inf)

k1 <- b1[unlist(lapply( m, function(x){ which.max(x<b1)-1 }))]
k2 <- b2[unlist(lapply( m, function(x){ which.max(x<=b2)  }))]

n1 <- which(((data$time[k2+1]-data$time[m]<0) & k1>1) | k2==nrow(data) )
n2 <- setdiff(1:length(m),n1)

cleanData <- data

cleanData$ID[m[n1]] <- data$ID[k1[n1]-1]
cleanData$ID[m[n2]] <- data$ID[k2[n2]+1]

As before, "m" is the vector of row numbers where the ID is missing. The vectors "b1" and "b2" contain those row numbers in "m" where a block of consecutive missing ID's starts and ends, respectively, i.e. the lower bounds and upper bounds of these blocks. So "m" is the union of the intervals "b1[i]:b2[i]" where "i" runs from 1 to the length of "b1" and "b2". Also "k1" and "k2" contain these bounds, but they have the same length as "m" and "m[j]" is contained in the block "k1[j]:k2[j]" for each index "j". The ID in the "m[j]"'s row is set to one of the ID's in the "k1[j]-1"'s row or "k2[j]+1"'s row. The comparison of the time in the "m[j]"'s row with the time in the k2[j]+1"'s row, resulting in the vectors "n1" and "n2", decides which one is chosen.

mra68
  • 2,960
  • 1
  • 10
  • 17
  • Hi, thanks for the alternative solution. However, it does not work, as I get an error given by the fact that data$ID[k1[n1]-1] and data$ID[k2[n2]+1] have 0 values (and therefore when I try your last two commands the output is that they have number of lines different than cleanData). Furthermore, I actually have blocks of 'wikimedia page' that should not inherit their value from other IDs – and the 'wikimedia page' lines I am interested in occur mostly alone, therefore this approach would cause some other issues. – Aliossandro Aug 07 '15 at 11:19
  • Perhaps "wikimedia page" appeared in the first or in the last line of the file. I modified the answer such that a missing ID in the first line is replaced by the first non-missing ID, no matter what the time stamps are. And a missing ID in the last line is replaced by the last non-missing ID. – mra68 Aug 07 '15 at 15:46