I have to do an analysis of scientific papers published in a list of over 20,000 journals. My list has over 450,000 records but with several duplicates (ex: a paper with more than one author from different institutions appear more than once).
Well, I need to count the distinct number of papers per journal, but the problem is that different authors not always provide the information in the same way, and I can get something like the following table:
JOURNAL PAPER
0001-1231 A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 AIDING EXPLORATORY TESTING WITH PRUNED GUI MODELS
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING.
0001-1231 DECYCLING WITH A MATCHING
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH.
0001-1232 DECISION TREE CLASSIFICATION WITH BOUNDED NUMBER OF ERRORS
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 COOPERATIVE CAPACITATED FACILITY LOCATION GAMES
0001-1232 OPTIMAL SUFFIX SORTING AND LCP ARRAY CONSTRUCTION FOR ALPHABETS
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M )
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M)
0001-1232 ON THE GEODETIC NUMBER OF COMPLEMENTARY PRISMS
0001-1232 DESIGNING MICROTISSUE BIOASSEMBLIES FOR SKELETAL REGENERATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS - ILLEGAL ALLOCATION
My goal is to use something like:
data%>%
distinct(JOURNAL, PAPER)%>%
group_by(JOURNAL)%>%
mutate(papers_in_journal = n())
So, I would have information like:
JOURNAL papers_in_journal
0001-1231 6
0001-1232 7
The problem is that you can see some errors in the name of the papers published. Some have a "period" at the end; some have spaces or replace symbols; some have other minor variations such as W[1]-HARD versus W-HARD. So, if I run the code as is, what I have is:
JOURNAL papers_in_journal
0001-1231 10
0001-1232 10
My question: is there any way to consider a similarity margin either in the use of distinct() or a similar command, so I can have something like distinct(JOURNAL, PAPER %whithin% 0.95)?
In this sense, I want the command to consider:
A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
=
A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
=
THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
DECYCLING WITH A MATCHING
=
DECYCLING WITH A MATCHING.
etc.
I imagine there is no such simple solution using distinct(), and I was not able to find any alternative commands to do that. So, if it is not possible and you can suggest any disambiguation algorithm I might use, I appreciate as well.
Thank you.