0

I have a large dataset of around 20M observations. I would like to calculate the Jaccard Index between TitleAbstract.x1 and TitleAbstract.y1 in each row.

This is a 2-observation sample:

    structure(list(Patent = c(6326004L, 6514936L), TitleAbstract.x = c("mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof.", 
"mechanical multiplier purpose speed steering control hydrostatic system invention concerned improvement control system hydrostatic drive vehicle comprising pair hydrostatic pumps output adjustable moving arm attached servo valve controlling displacement said pumps, pump powering respective hydraulic motor drives respective ground engaging means said vehicle. improvement present invention mechanically controls speed steering functions system. comprises pair adjusting means, one communicating pumps, comprising frame adjacent pump, first crank mounted centrally frame, first end first crank drivingly linked arm; second crank mounted centrally frame, first end second crank drivingly linked second end first crank third crank mounted centrally frame, first end third crank drivingly linked second end first crank second end third crank drivingly linked steering linkage means. improved arrangement includes tying means drivingly mounted adjacent second end second cranks linking movement thereof."
), cited = c(4261928L, 4261928L), TitleAbstract.y = c("antiviral methods using fragments human rhinovirus receptor (icam-1) ", 
"antiviral methods using human rhinovirus receptor (icam-1) method substantially inhibiting initiation spread infection rhinovirus coxsackie virus host cells expressing major human rhinovirus receptor (icam-1), comprising step contacting virus soluble polypeptide comprising hrv binding site domains ii icam-1; polypeptide capable binding virus reducing infectivity thereof; contact conditions permit virus bind polypeptide."
), Jaccard = c(0, 0.00909090909090909)), row.names = c(NA, -2L
), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f9c8f801778>, sorted = "cited", .Names = c("Patent", 
"TitleAbstract.x", "cited", "TitleAbstract.y", "Jaccard"))

Following previous posts I used a homemade equation to calculate the Jaccard Index, and created a function to then run with Mapply but I get an error 'this is not a function'.

Jaccard_Index <- function(x,y)
{
  return(mapply(length(intersect(unlist(strsplit(df$TitleAbstract.x1, "\\s+")),unlist(strsplit(df$TitleAbstract.y1, "\\s+")))) / length(union(unlist(strsplit(df$TitleAbstract.x1, "\\s+")),unlist(strsplit(df$TitleAbstract.y1, "\\s+"))))))
}

mapply(Jaccard_Index,df$TitleAbstract.x1,df$TitleAbstract.y1)

I tried changing TitleAbstract.x1 and TitleAbstract.y1 with x an y but still get the same error.

This is probably a novice question, but could anyone help me writing the correct function?

Also, I have two more questions:

Q2 How do I use parallel & mcapply to speed up this process?

Q3 What are the limitation of R in terms of memory storage and speed, would you recommend using a different approach (i.e. using python through the bash) for long and memory intensive processes?

Edit

I have uploaded the right dataset, I had to update my RStudio in order to avoid the truncated dataset.

Amleto
  • 584
  • 1
  • 7
  • 25

1 Answers1

1

I simplified your data set a bit. You can use stringdist() from the package of the same name, though this doesn't apply the Jaccard index with words as unit, so I fixed your Jaccard_Index() instead. This is using mapply(), but if you want to parallelize it you simply replace it with mcmapply()

df <- data.frame(
Patent=1:3, 
TitleAbstract.x1=c(
"methods testing oligonucleotide arrays methods testing oligonucleotide",
"isolation cellular material microscopic visualization method microdissection",
"support method determining analyte method producing support method producing"), 
TitleAbstract.y1=c(
"support method determining analyte method producing support method producing",
"method utilizing convex geometry laser capture microdissection process",
"methods testing oligonucleotide arrays methods testing oligonucleotide"),
stringsAsFactors=FALSE)


Jaccard_Index <- function(x, y) {
    if (length(x) == 1) {
        x <- strsplit(x, "\\s+")[[1]]
    }
    if (length(y) == 1) {
        y <- strsplit(y, "\\s+")[[1]]
    }
    length(intersect(x, y)) / length(union(x, y))
}

# Appears to be that splitting the strings outside the loop is quicker
df$TitleAbstract.x1 <- strsplit(df$TitleAbstract.x1, "\\s+")
df$TitleAbstract.y1 <- strsplit(df$TitleAbstract.y1, "\\s+")

mapply(Jaccard_Index, df$TitleAbstract.x1, df$TitleAbstract.y1, USE.NAMES=FALSE)
# [1] 0.0000000 0.1538462 0.0000000
AkselA
  • 8,153
  • 2
  • 21
  • 34