0

I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching.

However, I get an error message given below:

Warning message:
In [`<-.data.table`(x, j = name, value = value) :
Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).

This error occurs even if I round the similarity matching down to 4 digits using signif(similarity_score,4).

Some more information about the input data and approach:

  1. The author_corrected_df is a data.table containing columns: "Author" and "Author_Corrected". Author_Corrected is an alphabet representation of the corresponding Author (Eg: if Author = Jack123, then Author_Corrected = Jack).
  2. The Author_Corrected column can have variations of a proper first name eg: Jackk instead of Jack, and I would like to populate the corresponding gender in this author_corrected_df called Gender_Dict.
  3. Another data.table called first_names_dict contains the 'name' (i.e. first name) and gender (0 for female, 1 for male, 2 for ties).
  4. I would like to find the most relevant match from the "Author_Corrected" per row with respect the the 'name' in first_names_dict and populate the corresponding gender (either one of 0,1,2).
  5. To make the string matching more stringent, I use a threshold of 0.9720, else later in the code (not shown below), the non-matched values are then represented as NA.
  6. The first_names_dict and the author_corrected_df can be accessed from the link below: https://wetransfer.com/downloads/6efe42597519495fcd2c52264c40940a20190612130618/0cc87541a9605df0fcc15297c4b18b7d20190612130619/6498a7
for (ijk in 1:nrow(author_corrected_df)){
  max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
  if (signif(max_sim1,4) >= 0.9720){
    row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))
    author_corrected_df$Gender_Dict[ijk] <- first_names_dict$gender[row_idx1]
  } else {
    next
  }
}

While execution I get the following error message:

Warning message:
In `[<-.data.table`(x, j = name, value = value) :
  Supplied 6 items to be assigned to 17789 items of column 'Gender_Dict' (recycled leaving remainder of 5 items).

Would appreciate help in terms of knowing where the error lies and if there is a faster way to perform this sort of matching (though the latter one is second priority).

Thanks in advance.

ds_newbie
  • 79
  • 8
  • Hi, I suggest you add `print(max_sim1)` and `print(row_idx1)` just after defining those variables. – cbo Jun 12 '19 at 13:18
  • Hi cbo, I tried adding the print statements to the variables, but not being able to figure out on how this is useful. Sample output looks like: 1 114654 1 114654 0.95 0.9333333 0.9333333 0.925 0.9142857 0.93 0.8933333 But I still get the same error as above. – ds_newbie Jun 13 '19 at 06:26
  • This confirms the `6 items to be assigned to 17789`, you want a 1 on 1 mapping. Check if you have several maxima by running your code without the loop (e.g with ijk <- 1). Then check the output of `max_sim1`, `max(stringsim(author_corrected_df$...` , `which.max(stringsim(author_corrected_df$...` , `author_corrected_df$Gender_Dict[ijk]` , `first_names_dict$gender[row_idx1]`. – cbo Jun 13 '19 at 08:18
  • You could check `row_idx1` and print where problems may be and take only one value out of all index (via a statistic for exemple). – cbo Jun 13 '19 at 09:38

1 Answers1

1

Following previous comments, here I select the gender most present in your selection :

for (ijk in 1:nrow(author_corrected_df)){
        max_sim1 <- max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")), na.rm = TRUE)
        if (signif(max_sim1,4) >= 0.9720){
                row_idx1 <- which.max(stringsim(author_corrected_df$Author_Corrected[ijk], first_names_dict$name, method = "jw", p = 0.1, nthread = getOption("sd_num_thread")))

                # Analysis of factor gender
                gender <- as.character( first_names_dict$gender[row_idx1] )

                # I take the (first) gender most present in selection 
                df_count <- as.data.frame( table(gender) )
                ref <- as.character ( df_count$test[which.max(df_count$Freq)] )
                value <- unique ( test[which(test == ref)] )

                # Affecting single character value to data frame
                author_corrected_df$Gender_Dict[ijk] <- value
        }
}

Hope this helps :)

cbo
  • 1,664
  • 1
  • 12
  • 27
  • Agree on the point of many matches for the same author in 'question'. hence, I have modified my code, to now store the stringsim results as a data.table, with 2 columns: first as an running index/counter and second as the similarity score for a particular author, then I sort this data.table in descending order of similarity score and retain only the first row. My assumption is that this will retain only one instance in every case of even multiple matched records. But still getting the same error. Will check for what is not working this time! Thanks cbo for your solution, will try that too! – ds_newbie Jun 13 '19 at 10:24
  • Ok this is similar to what I do with df_count then. You can also use other stat than `max` replacing `wich.max` in ` ref <- as.character ( df_count$test[which.max(df_count$Freq)] )`. Cheers ! – cbo Jun 13 '19 at 12:40