I need to identify all countries mentioned in a text file using the nametagger model. However, I found out that there are mistakes in the Output. For expample, it identify Cuba as 'O' instead of 'B-LOC'. Also, it cannot correctly identify words which are part of a country's name. For example, 'Kingdom' is not 'B-LOC' while I cannot find a way use the model with bigram tokens. In short, I wonder how I can find the correct country name in multiple characters like United Kingdom etc? Using methods other than the nametagger model is also ok!
Thanks!
Here is the code I tried:
model <- nametagger_download_model("english-conll-140408", model_dir = tempdir())
predict(model, rename(refugee_df_udp, text = lemma)) %>% filter(!entity %in% c("O", "B-PER")) %>% distinct(term, .keep_all = TRUE)