Good afternoon,
I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.
The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.
How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?
Thanks in advance.
I have done the POS task in this way. I generated a couple of functions.
suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))
# load the corpus
tm_corpus <- VCorpus(DirSource(
"working_path,
pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))
# load udpipe
library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)
# functions to annotate the corpus
f_udpipe_anot <- function(n){
txt <- as.character(tm_corpus[[n]]) %>% #lista simia
unlist()
y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
y <- as.data.frame(y)
}
pinkillazo <- function(desde, hasta){
resultado <- data.frame()
for (item in desde:hasta){
print(item)
resultado <- rbind(resultado, f_udpipe_anot(item))
}
return(resultado)
}
leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe
To identify the named entities, I have tried this:
spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)
organiz <- spacy_extract_entity(
quan_corpus,
output = c("data.frame", "list"),
type = c("all", "named", "extended"),
multithread = TRUE,
)
I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.