I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching.
However, I get an error message given below:
Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
This error occurs even if I round the similarity matching down to 4 digits using signif(similarity_score,4).
Some more information about the input data and approach:
- The author_corrected_df is a data.table containing columns: "Author" and "Author_Corrected". Author_Corrected is an alphabet representation of the corresponding Author (Eg: if Author = Jack123, then Author_Corrected = Jack).
- The Author_Corrected column can have variations of a proper first name eg: Jackk instead of Jack, and I would like to populate the corresponding gender in this author_corrected_df called Gender_Dict.
- Another data.table called first_names_dict contains the 'name' (i.e. first name) and gender (0 for female, 1 for male, 2 for ties).
- I would like to find the most relevant match from the "Author_Corrected" per row with respect the the 'name' in first_names_dict and populate the corresponding gender (either one of 0,1,2).
- To make the string matching more stringent, I use a threshold of 0.9720, else later in the code (not shown below), the non-matched values are then represented as NA.
- The first_names_dict and the author_corrected_df can be accessed from the link below: https://wetransfer.com/downloads/6efe42597519495fcd2c52264c40940a20190612130618/0cc87541a9605df0fcc15297c4b18b7d20190612130619/6498a7
for (ijk in 1:nrow(author_corrected_df)){
max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
if (signif(max_sim1,4) >= 0.9720){
row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
} else {
next
}
}
While execution I get the following error message:
Warning message:
In `[<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).
Would appreciate help in terms of knowing where the error lies and if there is a faster way to perform this sort of matching (though the latter one is second priority).
Thanks in advance.