Remove all instances of duplicates in bibliographic dataset in R

Question

I have two bibliographic datasets A & B (.bib files, WoS export, full record & cited references). Both of them contain relevant and irrelevant results. The first dataset A has been cleaned so that I have the relevant results A(r) and irrelevant results A(i) as two different datasets (.bib files). The second dataset B encompasses my first dataset A completely. visualisation of my two datasets

Goal: I am looking for a way to remove the irrelevant results A(i), which I have already identified in my first dataset, from my second dataset B.

Approach: If I were to merge the datasets B & A(i) I could trace the irrelevant results A(i) in B by using a remove duplicate function since A(i) would occur twice in B. However, this would only remove the duplicates of A(i) and not all instances of A(i).

Functions to remove duplicats:

package revtools

matches <- find_duplicates(data, match_variable = "title")

data_unique <- extract_unique_references(data, matches)

package bibliometrix

duplicatedMatching(M, Field = "TI", tol = 0.95)

•Q1: Is there a way to remove all instances of duplicates (the duplicates and the originals) identified through a find/remove duplicate function?

•Q2: Is there a better way for removing A(i) from B? i.e. remove all instances of duplicates in a dataset

•Q3: More generally asking: can I search for a larger amount of specific bibliographic data in my dataset (a list of papers) and remove it from that dataset?

Thank you so much for your help!

score 0 · Accepted Answer · answered Dec 04 '19 at 13:00

You can use match to find identical title in two data sets.

#remove Ai from B
B[-match(unique(Ai$title), B$title),]
#  title misc
#1     a    X
#2     b    X
#5     e    X
#7     g    X

#remove Ai and Ar from B
B[-match(unique(c(Ai$title, Ar$title)), B$title),]
#  title misc
#7     g    X

Data:

Ar <- data.frame(title=c("a", "b", "e"), misc="X", stringsAsFactors = FALSE)
Ai <- data.frame(title=c("d", "c", "f"), misc="X", stringsAsFactors = FALSE)
B <- data.frame(title=letters[1:7], misc="X", stringsAsFactors = FALSE)

Remove all instances of duplicates in bibliographic dataset in R

1 Answers1