Am relatively new to data linkage in general and the R RecordLinkage package in particular. I have data like below:
require(RecordLinkage)
library(RCurl)
dss_member <- read.csv(text = getURL("https://raw.githubusercontent.com/kilimba/data/master/dss_member.csv"),
stringsAsFactors = F)
dss_member$id <- NULL
patient <- read.csv(text = getURL("https://raw.githubusercontent.com/kilimba/data/master/patient.csv"),
stringsAsFactors = F)
patient$id <- NULL
rpairs <- compare.linkage(patient,dss_member)
rpairs$pairs
rpairs <- epiWeights(rpairs)
summary(rpairs)
as you can see I have two data frames, dss_member
(11 rows) and patient
(5 rows). I have inserted a row in both which should in theory definitely be a link, the user James Earl Jones. However I have 2 concerns.
The line
rpairs$pairs
results in output where the last columnis_match
always shows as NA, even though I am sure of at least one row being identical in both datasets. What does this mean? This is related to another SO question which is yet to be answered.The lines
rpairs <- epiWeights(rpairs)
summary(rpairs)
give a result as following:
Linkage Data Set
5 records in data set 1
11 records in data set 2
55 record pairs
0 matches
0 non-matches
55 pairs with unknown status
Weight distribution:
[0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
47 1 3 2 2
(a) Why does it show 0 matches and 0 non-matches, when there is definitely at least on match (James Earl Jones)
(b) Is the identity
argument in the function compare.linkage()
optional? and if so, what happens when you leave it out vs putting it in?
(c) Can one use this package even in the absence of a "Gold Standard" to perform record linkage, and not record linkage evaluation?
Kind regards, Tumaini